In the mid 1960's, semiconductor manufacturers observed that the density of circuits, such as transistors, fabricated on integrated circuits was doubling about every 18 months. This trend has continued and is now termed “Moore's Law.” The transistor density is viewed as a rough measure of computer processing power, which, in turn, corresponds to data processing speed. Although Moore's Law was originally made as an observation, over time Moore's Law has became widely accepted by the semiconductor industry as a fundamental driving force behind increasing computer processing power. As a result, semiconductor manufacturers have developed technologies for reducing the size of chip components to microscale and even nanoscale dimensions. Computer system architectures for computer systems (some examples of which are a memory module system, a single core processor device or a multi-core processor device) are encountering limitations while trying to keep up with Moore's law.
The multi-core system example illustrates some of the problems encountered. In recent years, the semiconductor industry has developed processors comprising two or more sub-processors, called “cores.” For example, a dual-core processor contains two cores, and a quad-core processor contains four cores. Typically, the cores are integrated, share the same interconnects to the rest of the system, and can operate independently. Although semiconductor manufactures can increase the transistor density of a single core, semiconductor manufacturers have not moved in this direction due to inefficient power consumption. The alternative is to increase the number of cores packaged on a single die. A die is a single layer of semiconductor material on which an integrated circuit (“chip”) is fabricated. However, on-chip and off-chip communication has emerged as a critical issue for sustaining performance growth for the demanding, data-intensive applications for which these multi-core chips are needed. Computational bandwidth scales linearly with the growing number of cores, but the rate at which data can be communicated across a multi-core chip using top-level metal wires is increasing at a much slower pace. In addition, the rate at which data can be communicated off-chip through pins located along the chip edge is also growing more slowly than compute bandwidth, and the energy cost of on-chip and off-chip communication significantly limits the achievable bandwidth. As a result, computer architecture is now at a cross roads and physicist and engineers are seeking alternatives to using metal wires for on-chip and off-chip communications.
Computer system components such as the cores on a chip communicate with each other over a common interconnect and share resources. One mechanism to avoid conflicts or collision is by using an arbitration mechanism by which the components can determine which gets access to the resource at any given time.
Arbitration for shared resources is critical for the performance of many systems, yet efficient arbitration among many requestors for a resource is often very slow relative to processor clock cycles. Furthermore, at high processor clock frequencies, arbitration can consume a great deal of power given a moderately complex electrical implementation.
Controlling N-input, N-output crossbars to assign a unique sender to each output port is a standard problem in computer networking. The usual hardware solutions are designed for systems with virtual output queues (VOQs), in which each sender has one VOQ per receiver. The best possible solution can be computed by an offline sequential algorithm in O(N2.5) time by the Hoperoft-Karp algorithm for maximum matching in a bipartite graph, but this would be far too slow for use as a crossbar arbitration scheme. Instead, for electronically controlled network switch fabrics, an online, parallel, iterative scheme is used. In each round of a multi-round iterative process, senders request the right to send to receivers, an arbiter sends grants back in response to some of these requests, and some of the grants are then accepted. A maximal matching is achieved in O(log 2(N)) rounds. The time required is typically measured in the tens of microseconds.
An arbitration scheme that can perform its task at a speed commensurate with the system in which it is operating to avoid becoming a bottleneck and which is low power is desired. Low complexity is also a desirable feature for an arbitration system.