Technology scaling has yielded a wealth of transistor resources and largely commensurate improvements in chip performance. These benefits, however, have come with an ever increasing price tag, due to rising design, engineering, validation, and application specific integrated circuit (ASIC) initiation costs. The result has been a steady decline in ASIC “starts.” The cycle feeds on itself: fewer starts means fewer customers to amortize the high fixed cost of fabrication facilities, leading to further increases in start costs, and further declines in starts.
When designing a product, engineers must choose between two less than ideal options. Either they must face the high fixed costs of ASIC production, and hope to amortize it over a large volume of parts, or they must choose a field programmable gate array (FPGA) with low fixed costs, but high unit part cost. The trade-offs are not just financial. ASIC performance is 3.4-4.6× (i.e., about 3-5×) faster and power consumption is 14× less than FPGAs, and certain applications, such as cell-phones, will simply require an ASIC for these technical advantages. However, FPGAs bring in-field re-programmability, which is useful for accommodating changing standards. This requirement drives the need for a manufacturing technology that provides the key advantages of FPGAs—low fixed costs, and quick turn-around on designs, which lead to lower engineering cost—coupled with the key advantages of ASICs—low unit cost, high performance, and low power.
A key aspect of developing such technology will be the need to develop a network that can serve to interconnect components. On-chip networks play a critical role in the performance of computing systems, from high-speed network routers, to embedded devices, to chip multiprocessors (CMPs). As more functionality is progressively integrated on a single die, the communication infrastructure that binds the components will play a central role in overall chip performance. Researchers have developed a number of innovations in various aspects of network-on-chip (NoC) design, including novel topologies, routing algorithms, and switches optimized for latency, fault tolerance, and power consumption.
There are a handful of technologies that target the gaps between the performance, cost, power, and convenience of ASICs and FPGAs, as follows.
System on a Package (SoP): SoP is a technology where multiple silicon dies are packaged together into the same chip package. Commercially available SoPs that are known are essentially multi-chip modules (discussed below). However, research devices have been produced that use flip-chip bonding of multiple dies. Use of SoPs offers a way to lower package costs, but still rely upon users to design and pay for fabrication of the constituent ASICs that are bonded together. It would clearly be preferable to instead develop a market for pre-fabricated ASIC components that interconnect in a standard way, and for which the interconnecting scheme for the ASIC components is not fixed once the SoP chip has been fabricated.
Multi-chip modules (MCMs): An MCM or multi-chip module consists of multiple silicon integrated circuits that share a single package. MCMs have been in commercial use for over 30 years, with packages as large as 10 cm on a side in use. MCMs amortize the packaging area overhead across multiple components. The closer together a pair of communicating chips is, the faster they can transmit signals to each other. Sharing a single package brings these chips closer together.
Systems on a Chip (SoC): People build SoCs by purchasing functional blocks and integrating them into a single design. However, at the end of the day, an SoC consists of a single custom silicon die, while it would be preferable to assemble blocks that have already taken the physical, silicon form of components.
Structured ASICs: Structured ASICs, also sometimes called platform ASICs, are multi-layer circuits, where the circuitry in the bottom layers is fixed, and only the top couple of layers (typically 2 to 3) are custom. The bottom layers form an array of logic units (i.e., lookup tables, flip-flops). These units are connected as dictated by the designer via custom wiring implemented in the top layers. Implementing a circuit in this way reduces the non-recurring costs of an ASIC, because only the top layers are custom, and thus, fewer layers must be designed, verified, and have masks built. Furthermore, the circuit is largely composed of fixed logic, so if an application maps well onto the array, it will perform better than an FPGA implementation and consume less area, thereby reducing the unit cost. The structured ASIC market is expected to reach $1.3 B by 2010, siphoning off 3.5% of the anticipated $31.4 B ASIC market. Structured ASICs are currently commercially available at the 180 and 250 nm nodes through companies such as AMI Semiconductor, ChipX, eASIC, Faraday, Fujitsu, and NEC.
Coarse-Grained Reconfigurable Devices: In the gap between structured ASICs and FPGAs are a new class of coarse-grained reconfigurable devices. These chips consist of relatively large reconfigurable “objects,” which are configurably connected FPGA-style. One startup, MathStar, Inc., recently introduced its second generation Field-Programmable Object Array (FPOA) family called Arrix, which supports 400 individually configured 16-bit objects connected via a 1 GHz programmable interconnect. A second startup, CSwitch, has announced an architecture consisting of configurable control, compute and switch nodes, connected via a 20-bit wide, 2 GHz interconnect fabric.
In some devices, these objects resemble processors. In some cases, they are targeted towards a specific class of applications, such as the picoArray™ from picoChip and wireless signal processing, while in other cases, such as QuickSilver, Ambric, and Cradle Technologies, the compute nodes are more general. In other devices, the entire device operates as a single reconfigurable processor.
Certain applications, such as HDTV decoding, map well onto these devices. Applications that map well generally contain significant amounts of traditional data parallelism, and operate on word-size chunks of data. Applications that do not map well are those that require specialized bit-level operations, and those with specific circuit requirements (e.g., analog-to-digital converters). It would be desirable to employ pre-made components that are readily interconnected to produce coarse-grained, configurable devices. Similar to the comparison to structured ASICs above, such an approach would provide the opportunity to mix and match both part types and fabrication technologies to produce a wider variety of coarse-grained chip devices.
FPGAs with Hard IP Cores: For years, FPGA manufacturers have provided complex fixed-logic cores inside their FPGA fabrics. For example, Virtex2Pro provides both fixed multipliers, SRAM blocks, and entire PowerPC cores. More recent products from Xilinx and Altera have become even more specialized, with specific FPGAs targeted at different market segments (e.g., the Xilinx “FX” series targeted at embedded processing, and the “SX” series aimed towards signal processing). The advantage to having these cores is that if a design requires them, they incur little of the area/delay/power overhead relative to an ASIC. The disadvantage is that the core selection is set by the FPGA manufacturers, with product offerings that are necessarily limited. It would be desirable to be able to synthesize a variety of complex logic functions cheaply into the same chip in contrast to these domain-specific FPGAs.
When developing a network that can bind pre-defined components together, it will be desirable to create one that can be readily configured after fabrication. No single fixed function network provides optimal performance across a range of traffic patterns. For example, a common network benchmark, uniformly random all-to-all communication, performs best with a network supporting direct non-local links, such as a fat tree, while a streaming-style benchmark is well suited for a simple mesh or ring network with efficient nearest neighbor communication support. Furthermore, even when optimal designs share a topology, the networks are provisioned very differently in terms of buffering and packet sizes. When an application is fixed, say in an embedded device such as an ASIC, the communication pattern is also likely to be fixed. The network should be tailored to a defined level of traffic without penalty. However, when an application and traffic pattern varies, as would likely be true in programmable devices, such as CMPs, overall performance suffers if there is no option to modify the network configuration.
To address this problem, it would be desirable to create a new type of on-chip network comprising a collection of building blocks that can be configured to function as an arbitrary network. The new type of network should support customization of network topology, link bandwidth, and buffering, all of which should be determinable post-fabrication, but prior to application runtime, thereby affording the opportunity to customize the network design to each application. Prior work indicates that there is significant opportunity to improve performance of a NoC by tailoring a network to a particular application. Currently, however, it has not been possible to realize these benefits with a single hardware configuration, since conventional networks on a chip cannot be configured in this manner.