1. The Field of the Invention
The present invention pertains to computer architecture. More particularly, the present invention pertains to a heterogeneous interconnect design having wires with varying latency, bandwidth and energy characteristics.
2. The Relevant Technology
One of the biggest challenges for computer architects is the design of billion-transistor architectures that yield high parallelism, high clock speeds, low design complexity, and low power. In such architectures, communication over global wires has a significant impact on overall processor performance and power consumption. VLSI techniques allow a variety of potential wire implementations, but VLSI wire properties have never been exposed to microarchitecture design.
VLSI techniques enable a variety of different wire implementations. For example, by tuning the wire width and spacing, one may design wires with varying latency and bandwidth properties. Similarly, by tuning repeater size and spacing, one may design wires with varying latency and energy properties. Further, as interconnect technology develops, transmission lines may become feasible, enabling very low latency for very low-bandwidth communication. Data transfers on the on-chip network also have different requirements—some transfers benefit from a low latency network, others benefit from a high bandwidth network and still others are latency insensitive.
A partitioned architecture is but one approach to achieving the above mentioned design goals. Partitioned architectures consist of many small and fast computational units connected by a communication fabric. A computational unit is commonly referred to as a cluster and is typically comprised of a limited number of ALUs, local register storage and a buffer for instruction issue. Since a cluster has limited resources and functionality, it enables fast clocks, low power and low design effort. Abundant transistor budgets allow the incorporation of many clusters on a chip. The instructions of a single program are distributed across the clusters, thereby enabling high parallelism. Since it is impossible to localize all dependent instructions to a single cluster, data is frequently communicated between clusters over the inter-cluster communication fabric. Depending on the workloads, different types of partitioned architectures can utilize instruction-level, data-level, and thread-level parallelism (ILP, DLP, and TLP).
As computer architecture moves to smaller process technologies, logic delays scale down with transistor widths. Wire delays, however, do not scale down at the same rate. To alleviate the high performance penalty of long wire delays for future technologies, most design efforts have concentrated on reducing the number of communications through intelligent instruction and data assignment to clusters. However, for a dynamically scheduled 4-cluster system, performance degrades by approximately 12% when the inter-cluster latency is doubled. Thus, irrespective of the implementation, partitioned architectures experience a large number of global data transfers. Performance can be severely degraded if the interconnects are not optimized for low delay.
Since global communications happen on long wires with high capacitances, they are responsible for a significant fraction of on-chip power dissipation. Interconnect power is a major problem not only in today's industrial designs, but also in high-performance research prototypes. Computer architecture is clearly moving to an era where movement of data on a chip can have greater impact on performance and energy than computations involving the data—i.e., microprocessors are becoming increasingly communication-bound.