As it is known in the art, symmetric multi-processing computers allow for high performance application processing. Typical symmetric multi-processing computer systems include a number of processors coupled together by a bus. One characteristic of a symmetric multi-processing system is that memory space is shared among all of the processors. One or more operating systems are stored in memory and control the distribution of processes or threads among the various processors.
By allowing many different processors to execute different processes or threads simultaneously, the execution speed of a given application may be greatly increased. In theory the performance of a system could be improved by simply increasing the number of processors in the multi-processing system. In reality, the continued addition of processors past a certain saturation point serves merely to increase communication bottlenecks and thereby limit the overall performance of the system.
For example, referring now to FIG. 1A, a typical prior art multi-processor system 2 including eight processors coupled together via a common interconnect bus is shown. During operation, each of the processors 3a-3h communicate with the other processors and with a shared memory 4 via a shared interconnect bus 5. The symmetric multi-processing arrangement of FIG. 1A has been adequate for multi-processors built to date. However, with the advent of faster microprocessors, a common shared interconnect is not capable of sufficiently exercising the full performance potential of the coupled microprocessors. Because the only communication link between the processors and memory is the shared bus, the bus may rapidly become saturated with requests from the processors, thereby increasing delays as each processor attempts to gain access to the system bus. Therefore, although the processors may be able to operate at enhanced speeds, the limiting factor in terms of performance is the available bandwidth of the system bus.
Communication bandwidth is a key factor in the performance of SMP systems. Since bandwidth may not be uniform between pairs or subsets of nodes in the SMP system, the industry uses a "bisection bandwidth" measurement for determining the communication bandwidth of an SMP system. Bisection bandwidth is determined in the following manner. All possible ways of partitioning the system into two portions of equal compute power (equal number of processors) are ascertained. For each partition, the sustainable bandwidth between the two partitions is determined. The minimum of all of the sustainable bandwidths is the bisection bandwidth of the interconnect. The minimum bandwidth between the two partitions indicates the communication bandwidth sustainable by the multiprocessor system in the presence of worst-case communication patterns. Thus, a large bisection bandwidth is desirable.
Several interconnection architectures or "topologies" have been used in the prior art to overcome bus saturation problems. These topologies include meshes, touri, hypercubes and enhanced hypercubes.
As an example, a mesh interconnect is shown as system 7 in FIG. 1B. The major advantage of the mesh network is its simplicity and ease of wiring. Each node is connected to a small number of other neighboring nodes. However, the mesh interconnect has three significant drawbacks. First, messages must on average traverse a large number of nodes to get to their destination, and as a result the communication latency is high. Second, the bisection bandwidth does not scale as well for a mesh topology as it does for other topologies. Finally, because each of the messages may traverse different paths within the mesh, there are no natural ordering points within an SMP system, and therefore the cache coherence protocols required to implement the mesh topology are often quite complex.
The torus, hypercube, and enhanced hypercube topologies are all topologies wherein the nodes are interconnected in various complex arrangements, for example in a torus arrangement or a cube arrangement. The torus, hypercube and enhanced hypercube interconnects are more complex than the mesh interconnect, but offer better latency and bandwidth than the mesh interconnect. However, like the mesh interconnect, the torus, hypercube and enhanced hypercube topologies do not provide natural ordering points, and thus a complex cache coherence protocol must be implemented for each of those systems.
In shared-memory multiprocessor systems, processors typically employ private caches to store data determined likely to be accessed in the future. Since processors may read data from their private cache and may update data in the private cache without writing it back to memory, a mechanism is needed to ensure that the private chaches of each of the processors are kept consistent, or coherent. The mechanism that is used to ensure coherency of data in the SMP system is referred to as the cache coherence protocol.
Besides the topology, bandwidth, and latency of the physical interconnect the efficiency of the cache coherence protocol is a key factor in system performance. Cache coherency protocols may introduce latencies, bottlenecks, inefficiencies or complexity in several ways.
The latency of load and store operations is often directly affected by the protocol of the design. For example, in some protocols, a store operation is not considered complete until all invalidate messages have made it to their target processors and acknowledgment messages have made it all the way back to the original processor. The latency of stores here is much higher than a protocol wherein the original processor does not have to wait for the Invalidates to make it to their destination. Further, the acknowledgments consume a significant fraction of the system bandwidth.
Bottlenecks are often introduced due to high occupancy of controllers. "Occupancy" is a term of art; it indicates the amount of time a controller is unavailable after it receives a request. In some protocols, when a controller receives a request corresponding to a memory location, it becomes unavailable for other requests to the same memory location until certain acknowledgments corresponding to the former command arrive at the directory. If the controller receives conflicting requests at a higher than average rate, it becomes a bottleneck.
The design of the cache coherence protocol also affects hardware complexity. For instance, some protocols introduce deadlock and fairness problems, which are then addressed with additional mechanisms. This results in added hardware complexity.
It is desirable to provide a symmetric multiprocessing system that minimizes the latency of operations, provides large communication bandwidth, provides low controller occupancy, and can scale to a large number of processors.