Chip-to-Chip Input/Output (I/O) Thrust
The development of future general purpose high performance computing systems faces many challenges. These include enormous power demands as well as satisfying the bandwidth and latency needs of communication-centric architectures. Achieving exascale performance and beyond will require getting the most data reliably from one point in the system to another in the fewest clock cycles and at the lowest energy. These issues are critical at all scales of the system and are especially so in the Chip I/O Thrust. The demand for increased chip I/O bandwidth at lower latency becomes even more challenging in light of the power budget for these functions.
The goal of chip input/output (I/O) research is to develop chip-to-chip communication techniques that reduce latency, increase bandwidth, reduce energy, and provide for a high degree of signal transmission reliability for future high performance computing platforms. As evidenced by Moore’s law, the last decade has seen an explosion of on-chip integration and communication. However, board-level technology has not experienced the same level of growth. By 2015, microprocessor manufacturers expect local on-chip interconnects to sustain operating speeds more than 4 times those of today, with chips having over 160 processor cores. However, in order to accommodate these advances chip-to-chip I/O technology must advance by a 6-fold increase over today’s capabilities. Power usage is a critical technical challenge faced by future HPC systems, as is the power-performance tradeoff. This is especially true for chip-to-chip I/O as major high-performance computing (HPC) suppliers estimate that upwards of 40% of system power is consumed moving bits from one chip to another in the chip-to-chip I/O link. To develop effective solutions, Advanced Computing Systems (ACS) research targets technologies to enable a complete I/O link, including the transceiver and the medium (electrical or optical) between chips, as well as the signaling and coding methods employed on this link.
Quilt Packaging (QP) is a very promising R&D project being done at the University of Notre Dame. This is a unique approach to improving electrical interconnect between die where pins (nodules) are formed at the edge of a chip enabling direct chip-to-chip electrical connections. This drastically shortens the interconnect path between chips enabling high speed, low power signal transmission between chips. Since this technology results in less I/O parasitic resistance, capacitance, and inductance, signal bandwidth to and from the chip is increased. In addition, the smaller load allows chip output buffers to be made smaller, further reducing power consumption.
Fig. QP-1: Quilt Packaging technology concept
Work performed in this project includes manufacturability, reliability quantification, and I/O performance characterization and optimization. In addition, work continues to explore optimal pin architectures for high-speed work. 20 µm wide nodules were manufactured and shown to operate at 110 GHz, with a resulting insertion loss of only 0.1 dB—a record in chip-to-chip I/O performance. With improvements in the nodule formation process, intricate, high-precision interlocking nodules have been developed. Using these structures, researchers at Notre Dame have improved their die assembly process and demonstrated a 4-die quilt. Quilt Packaging remains a unique, promising, high-value technology for short-distance electrical chip-to-chip links, capable of enabling bandwidths up to 2 TB/s per chip, while minimizing power consumption via low RLC interconnect and smaller output driver circuits.
Fig. QP-2: High-precision interlocking nodules (left) and the first 4-die quilt (right)
Quilt Packaged Photonic Interconnect (QPIC)
Given the Quilt Packaging success demonstrated at Notre Dame, ACS and a large, distinguished team of researchers from Sandia National Labs are working to transition the Quilt Packaging technology towards commercialization. This two year program transfers the Quilt Packaging technology from Notre Dame to Sandia, develops the technology to a TRL level 4.5 process, and demonstrates its use by integrating electronic and photonic integrated circuits (IC’s) for chip-to-chip interconnect. Once the technology is refined and working at Sandia, SNL will be able to provide Quilt Packaging manufacturing on a limited quantity basis for future high performance computing needs. While the primary goal is to accelerate the transition of these technologies towards commercialization, a secondary goal is to demonstrate how these technologies can combine to create new capabilities. The demonstration will showcase a viable path for manufacturing Quilt Packaging technology while mating it to an optical interconnect capability that provides less latency, lower power, and higher bandwidth than electrical interconnect fabrics.
Fig. QPIC-1: Quilt packaged photonic interconnect demonstrator
Fig. QPIC-2: Award-winning microphotonic circuit QP nodules developed at Sandia
Estimates of the processing power demand on future HPC systems vary, but in alignment with the approximations of other researchers in the field, the long term goal of this work is to enable a 10 TFLOP HPC socket for 2017, implemented at the 17 nm technology node. At the data requirement of 1 byte/FLOP, the required aggregate (bidirectional) bandwidth interconnecting system nodes and memory will be 20 TB/s. With 5 GHz cores, 10 TFLOPS will require 200 W. To keep the socket power to 250 W, the supporting photonic network should consume just 50 W; 20 TB/s on-chip bandwidth and 20 TB/s off-chip bandwidth should each dissipate 25 W. This sets a link energy goal of 156 fJ/bit.
(U) The Nanophotonics research work at HP investigates microring resonator optical modulation devices. By applying a voltage across a PN junction to inject charge carriers into an intrinsic silicon microring (p+ region inside the ring and n+ region surrounding it), the refractive index is changed such that a given device is brought into or out of resonance with the wavelength of a specific communication channel. Such a microring resonator is shown in Fig. NP-1. Compared with other optical modulation approaches such as Mach-Zehnder interferometers, or modulation by optical absorption, microrings have several advantages. They are silicon based and CMOS process compatible, allowing for their integration with CMOS logic on the same die. Generally, microrings are physically smaller than other modulation devices, allowing for more dense integration and lower capacitance; lower capacitance implies the potential for faster and lower power operation. Additionally, microrings can be used not only as modulators, but also as wavelength filters (including higher order filters), and for multiplexing and de-multiplexing of channels on different wavelengths. The drawbacks of this approach involve the fabrication challenges of producing identical rings across the design, and that thermal tuning (and control system) is needed to compensate both for manufacturing variation and heating from nearby components.
Fig. NP-1: Silicon photonic microring modulator, anode and cathode contacts show in inset.
Under this work, a single photonic wavelength division multiplexed (WDM) interconnect link operating near 1310 nm will be developed. The goal is to create a 160 Gb/s link, of 16 channels at 10 Gb/s, including CMOS RF driver circuits. HP is also developing silicon-compatible photodetectors as part of their photonic link under this research.
To investigate the benefits of three-dimensional integrated circuit (3DIC) technology to high performance computing applications, ACS is funding research at Georgia Tech. The 3D MAny-core Processor with Stacked memory (3DMAPS) project’s goal is to research 3-D architectures for high performance computing and will quantify the benefits of using 3-D packaging in conjunction with optimized architectural considerations. A team of graduate students led by professors Sung Kyu Li, Hsien-Hsin Sean Lee, and Gabriel Loh is investigating, and quantifying the advantages of using 3D packaging in optimizing bandwidth and power consumption between microprocessor and memory. In this research the plan was to design and build a custom three-chip subsystem: a custom many-core processor, an instruction memory chip, and a data memory chip. These chips will be stacked and connected in a 3DIC fashion with through-silicon vias (TSV) and tested with documented benchmark tests. Although there has been much university research in developing 3D physical structures, this research is the first from academia to produce a fully functional 3D many-core processor.
Georgia Tech researchers have completed a two-tier design that is being fabricated on a DARPA-sponsored run at Tezzaron. Georgia Tech’s 3D computer aided design (CAD) flow was completed and tested, and was used in the design of a 64-core processor (8x8, 500 MHz cores) with an 8 KB SRAM memory per core. The initial design implements a local memory access model where each core has its own dedicated space. I/O pads are placed around the periphery for this design, whereas future designs will utilize full area-array flip chip I/O’s. Approximately 30,000 TSV’s were required for the entire design. The aggregate bandwidth achieved was 61.3 GB/sec (@ 277 MHz). The design was limited to Charter Semiconductor’s 130 nm technology. Fig. 3DMAPS-1 depicts the two-tier design. The two dice are bonded face-to-face to provide electrical interconnections from processor to SRAM.
A three-tier Version 2 design is currently underway to incorporate the Tezzaron-provided DRAM layers, for an upcoming planned MOSIS/Tezzaron wafer run. Using the DRAM layers only in the second run was prudent; the release of the data necessary to design the DRAM controller and the DRAM I/O information was delayed, and would have put the design schedule for the first design at risk if the DRAM had been relied upon. 3DMAPS version 2 will include:
- 3DMAPS w/ SRAM layer that also includes a DRAM controller
- DRAM peripheral logic
Fig. 3DMAPS-1: Rendering of the 3DMAPS version 1 design, two-tier processor/SRAM stack
Fig. 3DMAPS-2: Rendering of the 3DMAPS version 2 design Ga Tech Researchers published one technical paper of their work: “Design and Analysis of 3D-MAPS: A Many-Core 3D Processor with Stacked Memory”, IEEE Custom Integrated Circuits Conference, 2010
ACS researchers are performing hands-on research in collaboration with the JHUAPL for the development of chip-to-chip conformal polymer optical waveguides. This project will develop techniques for manufacturing polymer waveguides that can be applied to high-density electrical substrates (such as silicon carrier) and provide high performance optical links in addition to the electrical interconnect. Specifically, techniques will be explored for routing vertical interconnect paths, incorporating mirrored surfaces to couple light between vertical paths and horizontal ones and for making 90 degree in-plane turns while minimizing losses. The concept is generalized in Fig. APL-1.
Fig. APL-1: Chip to chip photonic interconnect using VCSEL’s and photodetectors.
Electrical and optical interconnects can be created on the same substrate.
The optical link is to be designed such that functional or memory chips are placed adjacent to or bonded on top of the photonic devices, the structure in the figure is to be inverted. The laser source is currently a directly modulated VCSEL operating nominally at 845 nm, and a photodiode is used for the optical detector. The functional or memory chips will communicate with the photonic devices via short electrical interconnects embedded in the substrate or carrier. These interconnects will be very short; lengths for which electrical interconnect has high performance and is power efficient. In this type of configuration close proximity chip-to-chip I/O will be performed electrically, while cross-carrier chip-to-chip I/O will be performed optically; this approach will be necessary as the size of a passive carrier grows while bandwidth requirements increase and link power budgets become more stringent.
Chip I/O researchers are moving forward with research in the modeling and simulation of chip-to-chip I/O links. Studies were done and investigations completed to evaluate several of the latest techniques and explore tradeoffs for improving signal integrity in high speed links, including model development and verification of the results.
Three studies have been completed. The first study was a signal integrity analysis of critical links on a functional circuit fabricated on a silicon carrier module. The second study was an analysis of a DDR3 channel between a processor/FPGA and memory. The third study was a comparison of optical versus electrical signaling. The results of the functional circuit and DDR3 modeling and simulation matched expected results. The optical versus electrical modeling uncovered some topics which require further investigation, including the lack of optical channel models and optical component models. Work continues with commercial tool developers in order to further enhance this modeling work.