ACS Interconnect Thrust: Information

Currently, interconnects consume over ¼ of the rack space and a sizeable portion of the power necessary for modern supercomputing. Interconnect performance is the limiting bottleneck for all supercomputers which operate on communication intensive problems, which are the types of problems encountered with systems of this type. These types of problems are best served by very large interconnects with very high switching speeds and small packets.
Due to the limited volume of supercomputer interconnects (3 to 5 per year), the interconnect research community partners with the telecommunications industry to achieve critical mass. Though telecommunications prefer larger packets and slower switches, there is significant overlap.
The goal of the interconnect thrust is to advance the research community in the direction of low-latency, small-packet, and very large interconnects. Another goal, initiated recently, is to coordinate the interconnect thrust with DOE's interconnect research directions, which focus more on point-to-point links.

DOS All-Optical Interconnect, UC Davis

The DOS switch exploits the wavelength routing characteristics of an arrayed waveguide grating router (AWGR) based switch fabric.  This type of switch fabric is highly scalable, exhibits very low latency, and performs well when heavily loaded.  The challenges with DOS are to scale the control seamlessly as the switch increases in size and to find a pathway for feasible production.  UC Davis has demonstrated the capacity to approach and fabricate all elements of the effort, including microfabrication, integration, hardware testing, device simulation, and architectural simulation.  They have produced some promising results so far.  After 10 months of work, they have completed the following:

  •   the system architecture including flow control, arbitration, optical channel adapters, packet formats, and interoperability with Infiniband and Rocket I/O.
  •   a cycle-accurate simulation model.
  •   Simulation studies of latency and bandwidth.
  •   Physical layer 8x8 switching with a line rate up to 40 Gb/s.
  •   An optical channel adaptor based on Rocket I/O.
  •   An FPGA based remote shared memory access application.

                (U//FOUO)The principals for this project are S. J. Ben Yu and Venkatesh Akella from UC Davis.

The OSMOSIS, Photonic Controls Inc.

The OSMOSIS project started in 2001 in collaboration with the DOE, with work carried out by Corning (optical switch) and IBM Zurich (electrical controller). Photonic Controls spun off from Corning and is continuing the work. 
Currently, Photonic Controls is integrating an existing photonic switch with an existing electronic controller. The photonic switch is a 64x64 prototype crossbar design which is highly scalable. As seen in Figure 1, an electronic controller operates the optical switches for the prototype.  It is presently a 16x16 controller constructed from 43 FPGA’s. Work is underway to integrate the controller with the switch to create a 16x16 prototype.  Though the controller could be enlarged for the entire 64x64 switch, a smaller size should be adequate for the planned bench testing.  The FPGA code is being ported to current generation FPGA's which are significantly more capable than the ones originally used by IBM.  Once integrated, the device will be the most capable switching device ever constructed. The goal is to conduct bench testing on the integrated device to assess its performance, its scalability, and its potential for further development. Figure 2 illustrates a finished architecture complete with a yet-to-be-developed host computer adapter.
This project has previously constructed a 64x64 switch that has demonstrated optical switching at speeds of 40Gb/s per port. A separate 16x16 electrical controller has been constructed, and a simulation model.  The current work aims to integrate the switch and controller.  This has produced a system integration plan along with control channel, test channel, and test logging demonstrations.   One new direction is to port working portions of IBM's FPGA code to current FPGA technology, and to design a new management processor adaptor.  These items have been completed and demonstrated.
The principal for this work is Ron Karfelt, President of Photonic Controls.


Figure 1: OSMOSIS demonstrator single stage switch system diagram.



Figure 2:  Major elements of the OSMOSIS interconnect including the HCA, amplifying multiplexers,
broadcast splitters, optical switch modules, and the centralized arbiter.


Optical Network Adapter, Columbia University/UMBC

This effort explored and constructed an optical network interface card which is transparent to the compute nodes it serves and which is capable of scaling through wave division multiplexing (WDM). Such a network adaptor would be completely agnostic to the details of the underlying photonic interconnect. The adapter is also designed using FPGA's, allowing architectural exploration, design validation, and experimental testing of a variety of computing systems and protocols.
So far, the researchers have constructed a testbed for end-to-end testing of their device. They have developed an all-optical interface capable of transparently formatting PCIe serial data streams into high bandwidth WDM packets operating at the maximum 10Gb/s over 4 WDM channels. They have also demonstrated a PCIe-based end-to-end transmission from a remote endpoint across the interface to a host computer at these speeds.
This last year, the research team has accomplished the following;
  • Optimized their FPGA evaluation board for burst-mode operation.
  • Integrated their burst-mode link layer design with an Infiniband (IB) link layer design.
  • Explored modifications to an IB switch to improve addressing, flow control, and resilience.
  • Completed a hybrid adaptor operating at 10 Gb/s using 4 2.5Gb/s links, thus demonstrating WDM transmission through their adaptor.
  • Integrated their adaptor with an optical network testbed featuring topological and protocol flexibility.  The testbed is capable of evaluating all-optical WDM-based interconnects.

 These adaptors are critical to the integration of optical packet switched networks into real-world computing systems.  The Data Vortex, for example, would require something like this adaptor before it could be used in an architecture or advanced prototype.    The principals for this work are Keren Bergman at Columbia Univ. and Gary Carter at UMBC.


Data Vortex, GA Tech/Columbia University

The data vortex, as depicted in Figure 3, is a multi-stage all-optical packet switch with distributed control. The architecture is built from very simple binary switching nodes that are designed to be transparent to wavelength division multiplexing (WDM) or signal modulation. The data vortex is defined by the height of its cylinders, by the angle or groups of nodes on each cylinder, and by the number of cylinders. Figure 3 illustrates a vortex with a height of 4, an angle of 3, and a cylinder count of 3. The latencies in the switch are near time of flight. The input ports are on the outermost cylinder of the vortex.  Data submitted through these ports progresses through the interior nodes and rings until reaching the destination ports at the innermost cylinder. If a pathway is blocked at some point, the packet is deflected to remain on its existing cylinder. Since the length of deflection is engineered to equal the packet size, no electrical buffering is necessary. This, coupled with efficient distributed control, make the Data Vortex very power efficient.
So far, the researchers have shown transmission speeds of 1.2 Tb/s per port over 33 40Gb/s channels through a data vortex link. They have demonstrated deflection on a 20 Gb/s channel. They have constructed a 12 port prototype and have demonstrated transmission at 2.5 Gb/s per channel over 4 channels for a total aggregate of 10Gb/s per port. They have constructed a data vortex simulator and have used it to evaluate architectural performance and to explore new vortex configurations.  They have developed a customized FPGA-based electrical subsystem and have used it to emulate memory units and to demonstrate end-to-end performance through the prototype vortex at 2.5 Gb/s. Using 8 channels on this subsystem, they have demonstrated a PCIe-based interface operating at 20 Gb/s.                This last year, the researchers have demonstrated low-latency distributed control averaging 110ns across the 12x12 prototype.  They have demonstrated full end-to-end processor-to-memory transactions at 10Gb/s.  They have also constructed a cycle-accurate simulator and have used it to explore architectural variations such as hierarchical topologies and express lanes.    The principals for this work are Kevin Martin, David Keezer, and Scott Wills of GA Tech, and Keren Bergman of Columbia University.


Figure 3:  A 12x12 Data Vortex with a height of 4, an angle of 3, and a cylinder
count of 3.

 

POWER EFFICIENCY THRUST


Stanford University – Power Efficient Supercomputing

Dr. William Dally at Stanford University is investigating low-power data transmission and storage, power efficient micro-architecture, and programming system power optimization under a BAA contract that was awarded at the end of 2008.  He and his team have started exploring the design of energy efficient channel circuits using ground-referenced, low-swing, differential signaling and double-edged clock repeaters.  They are also investigating a new SRAM design that uses the memory cells themselves to regenerate ground-referenced, low-swing differential write signals, reducing write energy by 66% compared to a conventional SRAM. 

Highlights of work accomplished this year include:
Development of a new self-calibrating low-swing capacitively coupled channel circuit.  This reduces data movement power compared to previous techniques while at the same time providing more tolerance to process variation.
Development of a cycle accurate, execution-driven simulator for a multicore chip that will be used for simulations of energy efficient mechanisms.
Commenced study of alternative active message implementations comparing alternatives in terms of power and performance.
Exploration of alternative memory hierarchies and development of a fully-associative cache directory structure based on hash tables.


Pennsylvania State University – Ultra Low Power Computing

A new effort with Penn State was awarded under the BAA to investigate ultra-low voltage/power computing.  Dr. Vijaykrishnan Narayanan and Dr. Suman Datta have begun the co-exploration of design, nanofabrication and assembly of nanoscale quantum devices and their integration with novel logic architectures in an effort to realize robust, ultra-low power computational systems.  They have demonstrated several generations of energy-efficient device architectures by addressing the fundamental limitations to supply scaling in traditional CMOS structures and have devised circuits capable of operating at 250mV or less. The focus of this new effort is to reduce the supply voltage further to 50mV. Such an ultra-low voltage operation in the single or few electron regimes requires a complete rethink of traditional circuit design to avoid high fan-out structures and contact related parasitic loss. Further, it requires a synergistic system architecture that is suitable for implementing such a circuit structure. This effort will explore a novel binary decision diagram (BDD) based reconfigurable logic architecture that uses two novel nanoscale quantum building blocks: Tri-Gate nanowire FETs and Split-Gate quantum nanodots using III-V compound semiconductor-based quantum wells. At the decision node in the BDD construct, they propose to implement a nanodot with split-gate architecture to confine the charge packet and control the conductance in each of the exiting arms of the node using Coulomb Blockade phenomenon. Since the path switching operation at the decision node is realized by “passive” transmission of messenger electrons through one of the arms, this implementation operates at ultra low power without requiring any large voltage gain or precise input output voltage matching or large current drivability of the node devices.  Key contributions expected from this research are:
1) Simultaneous development and optimization of nanoscale quantum devices to implement system level logic architecture operating with record power-delay product reaching the quantum limit of kT/q*ln(2)
2) Design of a novel reconfigurable BDD-based logic architecture for achieving reliable ultra low power operation.

The results from this research will foster new research directions at the intersection of material science, electrical engineering and computer engineering extending mesoscopic devices to real applications. This research on split-gate nanodots and wrapped gate nanowire FETs directly addresses the quest in Industry for longer term solutions to power-aware technology scaling. The new reconfigurable BDD logic architecture using these devices will enable integration and meaningful usage of a broader range of emerging nanodevices towards realizing energy-efficient ULSI systems.

 

Productivity Thrust


ParalleX Execution Model

The scalability and performance of complex parallel mission applications are hindered by overheads, starvation, latency, and contention. Dr. Thomas Sterling at Louisiana State University (LSU) led the development of ACS-sponsored  ParalleX execution model effort aiming to eliminate or mask the effects of these performance degrading phenomena through synergistic combination of multithreading, message-driven computation based on a variant of active messages, lightweight synchronization, and meaningful encapsulation of the first-class computation objects in a globally accessible name space.

Most of the objectives in the three principal areas of the project, (1) High-performance ParalleX (HPX) runtime system, ParalleX Interface (PXI) specification document, and PXlib thin-layer interface library, described collectively as the software system, (2) Field Programmable Gate Array (FPGA) based port of portions of system stack to the hardware domain (hardware architecture), and (3) power consumption evaluation, have not only been met, but some of them were exceeded. The HPX implementation library provides optimized, flexible, modular, and global barrier-free runtime system for parallel applications. The related activity involves the development of the low-level, C language programmers’ interface to the runtime layer (PXI), its specification document, and a lightweight implementation called PXlib. Of course, for the purpose of evaluation and testing, a number of applications have been developed, which range from the area of Adaptive Mesh Refinement (threshold of singularity creation in black hole astrophysics), through dynamic graph processing (chess playing program) to traditional dense numerical problems (Linpack).

ParalleX is steadily gaining importance and recognition as one of the feasible execution models to drive the future Exascale platforms. This status has been achieved through a number of invited talks and presentations delivered by the PI and project members to both domestic and international High Performance Computing community, academic as well as in industry.
 

R-Stream Based Scheduler

Reservoir used its proprietary R-Stream® technologies to rapidly build a scheduler for instructions which can automate the tedious tasks that would have been faced by accelerated supercomputer developers. The scheduler is based on a mix of advanced heuristics and also utilizes mathematical operations research libraries to optimize the projection of the computation into space and time.  As a consequence, the level of abstractions by which developers can now utilize the machine is increased; developers can focus on the “what” in their algorithms, and much less on the “where” and “when.”
In conjunction, these optimizations make it possible to write portable, easier-to-maintain programs and composable libraries.  In benchmark testing, they have demonstrated the ability to bring up a kernel in one day for the machine, a task that would previously have taken developers months, a very significant improvement in productivity.  Furthermore, they find that the resulting automatically programmed data choreographies can beat the performance of hand code by up to 1.5 X, because the scheduler can manage complexity and spot opportunities for optimization that a developer would miss.  The 1.5 X improvement in performance will translate directly into energy efficiency improvements by 1.5 X in large supercomputing installations, with significant potential operational costs savings.  Reservoir has also demonstrated the ability to generate program schedules for applications where no baseline exists because developers cannot keep up! 

Another outcome of Reservoir’s breakthrough is that the breadth of mission for this tool will be vastly expanded.  With the tool, the number of developers who will be able to use the tool will increase by approximately 5X.


Scalable Processing in Dense Memory for SAT Solvers

Dr. Peter Kogge of Emu Solutions (www.emusolns.com) is leading an ACS-sponsored project to demonstrate a highly scalable revolutionary new architecture and programming model that can greatly enhance the amount of concurrency that can be found in massive non-numeric problems that are of real economic benefit. In particular, the SPIDERS (Scalable Processing In Dense memory to Enhance Random computation in SAT solvers) project is attacking perhaps the most fundamental non-numeric problem: Boolean Satisfaction (SAT), using as a baseline a world –class parallel SAT solver, Alef, from Reservoir Labs, Inc.

Mainstream computing is processing focused but mission-critical applications are data intensive and require handling firehoses of new data and “in memory” analysis of voluminous, complex, heterogeneous, unstructured data, with random access and little locality. Emu Solutions is changing the processing paradigm by embedding lightweight computing (patented Gossamer cores) in a sea of memory which enables moving the program rather than the data. This has the effect of greatly reducing latency and bandwidth by not having to return intermediate results to a conventional core, and very efficiently initiating parallelism where the data is, not in some remote processor.

SPIDERS project has demonstrated a single node multi-threaded structure for supporting the key computational core of Alef (called BCP – Boolean Constraint Propagation), and projected how this combination would port to a full multi-node SPIDERS system. A prototyping platform based on collections of FPGA cards from Pico Computing, Inc was chosen, and a Gossamer core developed and demonstrated on individual cards. The Alef SAT solver was instrumented so that a trace file could be generated from any problem, and then fed to a driver that initiates SPIDERS threaded code. A detailed performance analysis projects significant levels of concurrency for a SPIDERS system with order of magnitude speedups over conventional systems. The Gossamer core design also includes the ability to initiate functions on special-purpose coprocessors (SPD’s) implemented on the FPGA with the core. Using new options from Picocomputing, SPIDERS will enable construction of a system with dozens of Gossamer cores, well over a 100GB of shared memory, and a software applications interface that is an extension of state of the art shared memory packages.

Emu is developing a fully functional system using the Picocomputing M503 chassis, capable of demonstrating a trace-driven BCP solver, running in the Emu Solutions lab. With additional effort beyond the current contract, Emu is expected to upgrade this effort to result in a much more robust M503-based platform that could in fact be used to develop and place into production full-scaled threadlet-driven applications beyond SAT. The additional work would permit deployment of a dual socket quad core Picocomputing-based host in a 4U chassis that can access 130+GB of memory in the M503s, with the potential for thousands of threadlets and dozens of SPDs, and robust software that allows a rich cross section of parallel applications to take advantage of the system’s capabilities.


Yggdrasil Language Translation

Under ACS sponsorship, Dr. Loring Craymer of University Research Foundation made significant progress on his ANTLR Yggdrasil language translation package and the mapping grammar approach to formal language translation. Mapping grammars specify transformations via annotation and assignment/instantiation of attributes that map an input language to an output language; provided that the mapping only involves terminal symbols, it is possible to derive a grammar for the output language. The major step forward has been the implementation of dependency graph algorithms to identify the possible grammar fragments at points where Yggdrasil attributes are instantiated. This completes the initial implementation of a grammar generator; some generated grammars still require manual editing, and refactoring may be necessary to remove duplicate token phrases, but the grammar generator significantly reduces the time needed to develop a transformation pass.

To provide a tutorial example and to encourage retargeting to other languages than Java, a translator was developed to generate code for the StringTemplate template language. This translator recognizes almost all of the StringTemplate syntax, although it was necessary to introduce an alternate syntax for anonymous template constructs to avoid multi-stage lexing, and code is generated for most of the syntactic constructs. An initial public release is pending.


Resilience Thrust


Executive Summary

High End Computer (HEC) resilience is a cross-cutting issue which affects all aspects of a system – from hardware to software, circuit design to programming language.  Systems are becoming increasingly less reliable with more frequent hard failures (stops) and soft failures (corrupted data, even silently).  As such, users need to have approaches for getting efficient use out of unreliable systems.
There are a myriad of approaches to reliability and they appear at all levels of the stack.  It is our opinion that the best approach to resilience is to have many solutions available that can be selected given the constraints each system presents.  For systems that are mainly write-intensive, non-volatile memory placed into the storage hierarchy and the use of frequent, rapid checkpoints appears to be a viable approach.  For failure-prone hardware with less defined operating parameters, a resilient programming approach is most likely the more fruitful.

In 2010 the ACS program funded three resilience projects.  Here, we detail only one - a university effort in resilience modeling exploring many key aspects of resilience from a theoretical basis.  This research looks into aspects of device refresh (rejuvenation), incremental checkpointing, checkpoint replication, and how to do efficient scheduling on systems where spare nodes are available.

The Challenge

High-end computers (HEC) and, in particular, extreme scale supercomputers fail.  Failures are becoming the norm, not the exception particularly on the largest machines in the US.  Failures appear in hardware and software, through external means (temperature, radiation, power integrity, etc) and internal (when we do this, it breaks).  Sadly, producing a machine that “just works” is a pipe dream and we have to have approaches to getting productive use of machines living in a failure rich environment.
Even simply isolating the problem to a piece of hardware or software can be unbelievably challenging.  For instance, on one of Lawrence Livermore Laboratory’s largest machines a team of HEC administrators and application scientists took several months to track down an elusive bug.  This particular bug affected exactly two processors out of thousands and only produced a wrong result when specific, very rare conditions presented themselves.  In this case, there existed an application that could reproduce this error consistently.  Clear questions arise:

  • How much money / resources were lost locating this bug?
  • How confident can we be that only these two processors are “bad”?
  • What recourse do we have anyway other than removing the components and hoping a replacement part works acceptably?

Fundamentally, resilience is a cross-cutting topic – it affects application programmers, storage systems, power, chip design, networks, etc.  The approaches to deal with it are varied and it is important to have a host of solutions available to pick the right one for the system being designed.
Below, we outline some of the challenges facing the resilience community.  While there are many more, in particular we highlight ones here of interest to the mission of the ACS program.

CHALLENGE: Consumer Products at Scale

Since the mid 1990s (in particular, with the “Beowulf revolution”), high end computing systems for the most part morphed into clusters of commodity, off the shelf components.  While certainly there are many special purpose machines being built, they come at several major costs:

  • There is no economy of scale, so the cost (in dollars) is elevated.
  • Since there are few or no other users, there is no benefit from a community making the platform robust.

In many ways, this is why the commodity cluster field is so large and remains one of the very few commercially viable sectors of high end computing.  Only a handful of vendors are able to support custom designs.
However, with clusters of commodity components come consumer expectations for quality.  The resilience community has seen the greatest challenge largely through the use of components that were never manufactured for use in extreme computing conditions.  Not only does scale play a part, but so do the demands of strange applications and near-constant utilization of the components.  There is a very direct and easy to identify correlation between the number of components in a system and the reliability of the system – the larger the component count of a system the lower the reliability.

CHALLENGE: Lower Power Means Lower Reliability

The device circuit community has known for a long time that one of the best ways to increase the integrity of a signal and reduce the chance for error is to increase the supply power.  This is in direct competition with the recent trends in reducing power consumption, in particular in extremely large supercomputers.  While this only relates to a particular type of resilience, it is one of the challenges.

CHALLENGE: Silent Data Corruption

While the satellite community has known about this problem for decades, the supercomputing community is only recently starting to experience it.  Silent data corruption (SDC) is simply a soft error where the system produces an incorrect result (1 + 1 = 3, as a trivial example) without identifying it as a fault.  In this way, it is silent and the application has little way of knowing it outside of its own means algorithmically.  This is not always possible.
SDC can come from many sources.  Often it is produced by radiation induced transient upsets, or cosmic radiation/rays, that “flip bits” in hardware.  Historically this has been dealt with through the use of error correcting codes (ECC) and we have seen a trend of moving ECC from the main memories into caches and lately into registers.
However, the device community is reporting (SELSE 2010 – Silicon Errors in Logic, System Effects) an increase in silent errors due to guard band violations in logic caused by miniaturization technology.  SDC is hard to detect, protect against, and deal with for application developers and the problem is only getting worse.

CHALLENGE: Just Checkpoint to NVM

One of the primary ways of dealing with hard failures on HEC systems is to save application state to some persistent storage, typically a parallel file system.  This approach is typically called checkpoint / restart.  As systems grow in memory size and node count, this becomes an ever more difficult task due to the large amount of data needing to be centralized.
With the increase in performance and decrease in cost of non-volatile memory (NVM) technology (see challenge above on consumer products), NVM is becoming very attractive as part of the storage hierarchy.  It appears at this time that the Department of Energy’s plans for exascale supercomputing will be to use NVM at the rack-level as a ways of taking quick checkpoints and then “bleeding” the data off slowly to a classical parallel file system.
This approach is very appealing for certain types of applications.  In particular, if an application writes often and rarely reads data (such as many numerical simulations) this approach is very powerful.  If, instead, the application is data-intensive and processes large amounts of data the approach is considerably more limiting.
The challenge in particular is that since DOE is one of the largest consumers of HEC cycles in the US, if they largely go this route it will leave other resilience approaches as unnecessary with respect to hard failures and in particular for machines that are predominately write intensive.  In particular, it will likely greatly reduce the reliability community and those with data-intensive applications will find there are fewer tools available to combat unreliable systems.

CHALLENGE: Failure Data is Nearly Non-Existent

One of the most interesting and frustrating challenges in resilience is that there doesn’t exist much data on how real systems fail.  There are many reasons for this, but they include the fact that most extreme-scale HEC systems are used for classified or proprietary purposes and dissemination of that information is rarely possible.  Secondly, and perhaps more annoyingly, the vendors involved in HEC systems are rarely willing to share failure data as they perceive it can be used against them.
The end result is that smaller scale researchers (such as at universities) have little failure data on real systems to use for modeling, simulation, and testing of novel approaches.

CHALLENGE: Resilient Programming

Currently, the onus falls on the application programmer to make their application resistant to failure.  The challenge, though, is that the application programmer has little information at his disposal to address unreliable systems.  It is not clear whether the right approach is to create new programming languages/models or middleware / runtime systems, or even augment operating systems to be more resilient. 
What does appear clear, however, is that the application programmer does not yet have the tools necessary to address the problem.  SDC (challenge above) is being considered a problem for uncertainty quantification (UQ) in next generation applications and UQ experts seem to agree that they are not ready to address the problem.

CHALLENGE: Unknown State of the System

One of the more surprising challenges to those unfamiliar with HEC reliability is that the system state of a large system is usually very difficult or impossible to determine.  Historically, the system state has been comprised of a collection of states from individual components but as systems have continued to scale it has become impractical to monitor the hardware state on most above average scale machines.
The implication of this is that it becomes exceedingly difficult to schedule resources, mitigate failing components, and predict when failure will occur.  We have seen that one of the best indicators of the health of a system is not what is available in the logs or RAS databases but the applications that run on the systems.


Modeling for Resilience Foundations

This project looks at a collection of fundamental resilience issues all with respect to modeling in an attempt to isolate where improvements in machine reliability can be gained.  The core vision of this work is based on the extension of LTU’s current, generic reliability, availability and serviceability (RAS) framework concept that coordinates individual solutions with regards to respective technology fields, and offers a modular approach that allows for live adaptation to system properties and application needs.  This work extends LTU’s resiliency approach through the investigation of the following seven areas: (1) advanced failure prediction and preemption models and techniques, (2) enhanced reliability models for system and applications, (3) efficiency analysis of HEC resilience approaches, (4) non-intrusive system health and application progress monitoring, (5) interoperability of models and mechanisms with respect to different parallel programming paradigms, (6) resiliency for heterogeneous HEC systems, and (7) scalability of HEC resilience mechanisms.

Work has been conducted on K-node reliability modeling with excess life and has resulted in several papers (see below).  Failure probability modeling has historically used exponential distributions or, more recently, the Weibull distribution.  This project instead proposes a Time to Failure (TTF) distribution of a system of k s-independent nodes when individual nodes exhibit time varying failure rates.  They explore the system reliability using the TTF model and validate using observed data on time to failure.  It should be noted though that lack of extensive failure data from large scale systems makes this work hard to validate.

The PI has created the Consortium for Resilience Studies (CRS) in HPC.  CRS-HPC is a collection of leaders in the resilience field from academia, government, and industry.  The goal of the CRS is to be the authority in the vocabulary, algorithms, and methods utilized in the resilience field.  The CRS will accept and promote a standardized resilience lexicon, made readily available to the community for use in research dissemination and collaboration.  Currently, the home page for this is http://resilience.latech.edu. 

The researchers have also been exploring the notion of incremental checkpointing and, specifically, how to model the scheduling of applications and checkpoints in this type of system.  While many user or corporate applications do not employ checkpointing, if incremental checkpointing theory was sufficiently explored the ability the checkpoint applications after small changes, rapidly, could enable make this technology attractive. Another area being explored is that of rejuvenation of components of HEC systems and, in particular to this work, how to model the scheduling of rejuvenation.  Rejuvenation is the act of refreshing hardware and software, often via a warm reset/reboot, in an attempt to return the system to a known, failure free, state.  While this approach is controversial in that it doesn’t address the fundamental problems inherent in the hardware or software, it is very much an engineering approach in that it has been shown extensively to work by reducing the failures on systems due to hardware or software that gets into a “bad state”.  One of the fundamental questions with rejuvenation though is when it should be done and how to predict that a component needs to be rejuvenated.  Aspects of the LTU work look at this, from a modeling perspective.  In particular, they are combining it with their TTF distribution and checkpoint modeling.

Resilient Programming Models

The Resilient Runtime System (RRtS) and programming model has some particularly interesting features. 

  1. RRtS manages all memory and network transactions.  As such, it is positioned to keep copies of all messages it transmits to other compute elements.  This, in turn, allows RRtS to have a message log of all messages.  These extra messages consume additional memory, obviously, but since RRtS manages memory it frees the copies whenever additional memory is needed.
  2. Message logging allows for RRtS to perform rollback/playback of communication and, due to the workflow programming model, therefore computation as well.
  3. With rollback/playback it becomes possible to return computation to a “known good state”, such as after data becomes corrupted in the system.  To accomplish this, RRtS utilizes the concept of provenance by imprinting each packetized unit of work in the system with a genetic signature that corresponds to where the unit of work came from and what the genetic signature of its “parents” were (meaning, the units of work that operated on this new data and manipulated or created it).
  4. With genetic signatures and rollback/playback it is possible to, when a data value has been corrupted, rollback all units of work that influenced the creation of this suspect data to a known good state.
RRtS is very much a research experimentation tool and is used to perform experiments on concepts that might become important to ACS in the future.

AMOEBA

AMOEBA is an end-to-end research effort with IBM that is exploring next-generation multicore architectures, algorithms that benefit from these architectures, and the enabling technologies that must be developed to overcome the physical barriers to realizing these architectures. 

2010 saw the substantial completion of the tasks associated with Phases 1-3, and the award of Phase 4 (A4).

A4 is a research program to explore and develop technologies for:

  • Designing chip-scale hybrid integration platforms to achieve high performance capability for high performance computing and application specific workloads.
  • Developing the required low power electrical and optical I/O, thermal & power management technologies.
  • Developing the necessary enabling infrastructure, including test, design tools, methodology and design recommendations.
  • Providing meaningful technology demonstration vehicles which advance learning and understanding of integration issues.
  • Continuing collaboration on algorithms for high-speed streaming keyword scanning and regular expression matching.

Machine Learning

This research is intended to create high performance software and hardware to enable pattern recognition/correlation tasks. The end goal for the software is to deliver additional intelligence where traditional tools for analysis may be inadequate for the task. Currently, this work has focused on implementing a novel learning algorithm developed in-house. Execution of this algorithm in hardware and software allows the creation of very fast learning and evaluation speeds that are much faster than any current implementations of this technology.

Also of note in this area is work using memristors to create processing nodes. Working with Sandia National Labs (SNL), the CEC developed software to interface to Spice simulation software. The CLA algorithm was used to train a node comprised of only four memristors and a differential operational amplifier (memron). This node learned all the linearly separable logic solutions. A new circuit was designed using three memrons, two on the first layer and one on the output.  The CLA algorithm was able to train this small network to learn an XOR function. This is the true test of a neural network. With these three memrons, all 16 functions from two inputs were learned (2 inputs = 4 states, 4^2=16 functions). The next step in this work is to extrapolate this design to the chip level using current specifications for memristors and operational amplifiers. The result of this work should show orders of magnitude more computational capability against the pattern matching problems than current CPUs, with orders of magnitude less power and full utilization of the memristors analog memory capability.