For several decades, a scaling trend in Complementary Metal-Oxide-Semiconductor (CMOS) technology has been driven by a desire for higher performance and lower cost integrated circuits. Scaling of CMOS devices has decreased digital gate capacitive loads, thereby reducing gate latency, and has also reduced the silicon area of the digital gates, which in turn has enabled the integration of more logic in the same chip area. The lower latency and parallelization resulting from the increased amount of logic in the same chip area has contributed to increased microprocessor performance in terms of instructions performed per second.
Two physical facts present significant challenges to continued performance improvement via simple device scaling. One is dissipated power. As clock frequency increases, more power is consumed up to a point where serious thermal and reliability issues can occur. Increased leakage, as a result of scaling, further cuts into the power budget, thus limiting the increase of frequency even more. The second physical fact is the scaling effect on the properties of metal wires that serve as device and unit interconnects. Previously regarded negligible, wire contribution to latency and power consumption has become significant with scaling due to increased resistance and more significant capacitance compared with Metal-Oxide Semiconductor (MOS) devices. While local wires still scale nicely since their lengths get shorter as devices are scaled, global wires used for communication across the chip are a potential inhibition to improving microprocessor performance.
One current approach for providing improved microprocessor performance with tolerable power consumption is chip multi-processors (CMPs). Under this approach, scaling is used to reduce the area of processor cores while keeping the clock operation frequency relatively constant. Integrating more cores on the same chip area allows a performance benefit through parallelization. Communication between on-chip cores and shared on-chip caches can be done using a traditional shared bus. However, one issue with this approach is that as the number of cores and cache banks further increase, the shared bus does not easily scale, thereby resulting in more latency for communication, higher power consumption, and increased chip area.
An alternative approach that seems to scale more easily is referred to as Network-on-Chip (NoC). In this approach, on-chip routers are used in conjunction with every on-chip core and cache bank. In a simple mesh topology, each router can communicate with that router's four orthogonal neighbors and with that router's adjacent core or cache. Multi-bit repeated wire based buses are still assumed to connect the NoC routers. As a result, although close neighbors can easily communicate with high bandwidth and low latency in an NoC, communication between distant points across the chip incurs longer latencies. The bottleneck gets even more severe when overall chip communication is considered, where larger traffic increases router congestion and makes actual latencies even longer. Thus, there is a need for a high bandwidth, power efficient technique for on-chip communication.