1. Field of the Invention
The present invention relates to multiprocessor computer systems and, more particularly, to techniques for increasing the bandwidth of intra-system data transmission in multiprocessor computer systems.
2. Related Art
Early computers each had a single processor, referred to as the central processing unit (CPU), that executed all software on the computer and otherwise controlled the computer's operation. The speed at which a computer could operate was therefore limited by the speed of the fastest individual processors available at any particular time.
Subsequently, computers were developed that could incorporate multiple processors. A single multiprocessor computer (MP) can use its multiple processors to perform operations in parallel, thereby achieving aggregate processing speeds greater than that of a single processor. Multiprocessor computers therefore can overcome, to a certain extent, the processing speed limitations imposed by the current state of the processor art. A single high-end server may include, for example, 1, 2, 4, 8, or 16 interconnected processors operating in parallel. Advances in multiprocessing technology will likely continue to increase the number of processors that may be interconnected within a single MP.
Although there are a variety of multiprocessor computer architectures, the symmetric multiprocessing (SMP) architecture is one of the most widely used architectures. Referring to FIG. 1, a computer system 100 having an SMP architecture is shown in block diagram form. The computer system 100 includes a plurality of cell boards 102a-d interconnected using a crossbar switch 116. Each of the cell boards 102a-d includes a plurality of CPUs, a system bus, and main memory.
For ease of illustration and explanation, the cell board 102a is shown in more detail in FIG. 1 and will now be described in more detail. The other cell boards 102b-d, however, may include components and a structure similar to that of cell board 102a. The cell board 102a includes a plurality of CPUs 104a-n, where n is a number such as 2, 4, 8, nor 16. The CPUs 104a-n include on-board caches 106a-n, respectively. The cell board 102a also includes a system bus 108, main memory 112a, and memory controller 110a. The CPUs 102a-n are coupled directly to the system bus 108, while main memory 112a is coupled to the system bus 108 through memory controller 110a. CPUs 104a-n may communicate with each other over the system bus 108 and may access the memory 112a over the system bus 108 through the memory controller 110a, as is well-known to those of ordinary skill in the art.
Although each of the cell boards 102a-d includes its own local system memory (such as memory 112a), the memories in the cell boards 102a-d may be addressed by the CPUs in the cell boards 102a-d using a single combined virtual address space. The crossbar switch 116 provides a mechanism for communication among the cell boards 102a-d to perform such shared memory access and other inter-cell board communication. In general, a crossbar switch is a device that has a number of input/output ports to which devices may be connected. A pair of devices connected to a pair of input/output ports of the crossbar switch 116 may communicate with each other over a path formed within the switch 116 connecting the pair of input/output ports. The paths set up between devices can be fixed for some duration or changed when desired. Multiple paths may be active simultaneously within the crossbar switch, thereby allowing multiple pairs of devices to communicate with each other through the crossbar switch simultaneously and without interfering with each other.
Crossbar switches may be contrasted with buses, in which there typically is a single communications channel shared by all devices. A significant advantage of crossbar switches over buses is that an increase in traffic between any two devices does not affect the traffic between other pairs of devices. Furthermore, crossbar-based architectures typically offer greater scalability than bus-based architectures.
The cell board 116 more generally is part of the “system fabric” or “switching fabric,” terms which refer to those components of the computer system 100 that enable the cell boards 102a-d to communicate with each other. If, for example, there were multiple crossbar switches in the system 100, the system fabric would include all such crossbar switches.
Cell board 102a also includes a fabric agent chip 114a that is coupled to the crossbar switch 116 and which acts as an interface between the cell board 102a and the other cell boards 102b-d in the system 100. The other cell boards 102b-d similarly include their own fabric agent chips 114b-d, respectively. Fabric agent chips 114a-d may be considered to be part of the system fabric.
As described above, the local memories in the cell boards 102a-d may be accessed using a single virtual address space. In an SMP such as the system 100 shown in FIG. 1, the fabric agent chips 114a-d in cell boards 102b-d enable this global shared memory address space. For example, consider a case in which CPU 104a issues a memory access request to memory controller 110a that addresses a memory location (or range of memory locations) in the global virtual address space. If the memory controller 110a cannot satisfy the memory access request from the local memory 112a, the memory controller 110a forwards the request to the fabric agent chip 114a. The fabric agent chip 114a translates the global memory address in the request into a new memory address that specifies the location of the requested memory, and transmits a new memory access request using the new address to the crossbar switch 116. The crossbar switch 116 forwards the memory access request to the fabric agent chip in the appropriate cell board.
The requested memory access is performed using the local memory of the receiving cell board, if possible, and the results are transmitted back over the crossbar switch 116 to the fabric agent chip 114a and back through the memory controller 110a to the CPU 104a. If the memory access request cannot be satisfied using the local memory of the receiving cell board, the memory access request may be satisfied using an I/O subsystem 118 coupled to the crossbar switch 116. The I/O subsystem 118 may, for example, include one or more hard disk drives that store infrequently accessed portions of memory according to a virtual memory scheme.
The CPUs in cell boards 102a-d may thereby access the main memory in any of the other cell boards 102a-d over the crossbar switch 116 using the fabric agent chips 114a-d in the cell boards 102a-d. One goal of such a system is to make the implementation of memory access transparent to the CPUs 104a-d, in the sense that the CPUs 104a-d may transmit and receive responses to memory access requests in the same way regardless of whether such requests are satisfied from onboard memory, offboard memory, or the I/O subsystem 118.
It can be seen from FIG. 1 that all data transferred to and from the cell board 102a must pass through the fabric agent chip 114a. The bandwidth to and from the cell board 102a therefore is limited by the maximum bandwidth of the fabric agent chip 114a. Similarly, the crossbar switch 116 has a maximum bandwidth due to its particular architecture and implementation.
The bandwidth limits imposed by the fabric agent chip 114a, crossbar switch 116, and the links between the cell boards 102a-d and the crossbar switch 116 (and between the other crossbar switches, if any) do not pose any problems so long as the total bandwidth required by the CPUs 104a-n in cell board 102a does not exceed the maximum bandwidth of the fabric agent chip 114a and crossbar switch 116. As processing speed continues to double roughly every 18 months, and as the number of CPUs in each of the cell boards 102a-d increases, however, the CPUs 104a-n may perform memory access operations which require higher bandwidth than can be provided by the fabric agent chip 114a and crossbar switch 116.
One approach that has been taken to this problem is to re-engineer the crossbar switch 116 when the CPUs 104a-n are replaced with faster CPUs, or when the number of CPUs in each cell board is increased. Re-engineering the crossbar switch 116, however, is tedious, time-consuming, and costly. In particular, speeding up the links in the crossbar switch 116 may be hindered by noise or other engineering obstacles, and increasing the number of links in the crossbar switch 116 increases the size of the crossbar switch 116, causes the crossbar switch 116 to consume more power, and increases the cost of manufacturing the crossbar switch 116.
What is needed, therefore, are techniques for increasing the bandwidth of intra-system data transmission in multiprocessor computer systems.