The present invention relates generally to high-performance parallel multi-processor computer systems and more particularly to a distributed directory cache coherence architecture where the coherence directories are not maintained at the location of memory unit.
Many high-performance parallel multi-processor computer systems are built as a number of nodes interconnected by a general interconnection network (e.g., crossbar and hypercube), where each node contains a subset of the processors and memory in the system. While the memory in the system is distributed, several of these systems (called NUMA systems for Non-Uniform Memory Architecture) support a shared memory abstraction where all the memory in the system appears as a large memory common to all processors in the system. To support high-performance, these systems typically allow processors to maintain copies of memory data in their local caches. Since multiple processors can cache the same data, these systems must incorporate a cache coherence mechanism to keep the copies coherent. These cache-coherent systems are referred to as ccNUMA systems and examples are DASH and FLASH from Stanford University, ORIGIN from Silicon Graphics, STING from Sequent Computers, and NUMAL from Data General.
Coherence is maintained in ceNUMA systems using a directory-based coherence protocol. With coherence implemented in hardware, special hardware coherence controllers maintain the coherence directory and execute the coherence protocol. To support better performance, the coherence protocol is usually distributed among the nodes. With current solutions, a coherence controller is associated with each memory unit that manages the coherence of data mapped to that memory unit. Each line of memory (typically a portion of memory tens of bytes in size) is assigned a xe2x80x9chome nodexe2x80x9d, which manages the sharing of that memory line, and guarantees its coherence.
The home node maintains a directory, which identifies the nodes that possess a copy of the memory line. When a node requires a copy of the memory line, it requests the memory line from the home node. The home node supplies the data from its memory if its memory has the latest data. If another node has the latest copy of the data, the home node directs this node to forward the data to the requesting node. The home node employs a coherence protocol to ensure that when a node writes a new value to the memory line, all other nodes see this latest value. Coherence controllers implement this coherence functionality.
While existing ccNUMA systems differ in the organization of the node and the system topology, they are identical in two key aspects of their coherence architecture. First, they implement a coherence controller for each memory unit, which maintains coherence of all memory lines in that memory unit. Second, the functionality of the coherence controller is integrated with the functionality of the memory controller of the associated memory unit. However, a solution based on the collocation of a coherence controller with each memory unit is not well matched with several trends in multi-processor computer system architecture. Since these coherence architectures require a coherence controller for each memory unit, the cost of the coherence mechanism is high in system architectures with high ratios of memory units to processor units. For example, the FLASH system requires as many coherence controllers as there are processors. While the cost of the coherence mechanism is lower when the system architecture has lower ratios of memory units to processors, these systems may not support the low-latency, high-bandwidth access to memory required for high-performance ccNUMA systems. One trend is to meet the ever-increasing memory bandwidth requirements of processors by using node designs with higher ratios of memory units to processor units. With as many coherence controllers as memory units, the large number of coherence controllers increases the cost of the system.
Integrating the coherence controller functionality with the memory controller functionality (as in these coherence architectures) may also not be a suitable approach with next generation processors where the memory or the memory controller is integrated with the processor on the same chip. In future processor architectures the memory (or the memory controller) will be integrated on the same chip as the processor to bridge the latency and bandwidth gap between the processor and memory. When memory is on the same chip as the processor, it may not be feasible to collocate the coherence control with the memory on the same chip. Such an approach would also disallow the tuning of the coherence protocol to meet requirements of specific ccNUMA system designs.
A coherence architecture where coherence directories and control are located in nodes at the site of memory may also result in longer access to remote data when the nodes are situated at the endpoints of the network. When a node requires access to data that is in a cache or memory in another node""s processor, a message must first traverse the network from the requesting node to the node maintaining the directory. Then, the node maintaining the directory must send another message to the node with the data. Finally, the data must flow from the node with the data to the node requesting the data. This shows that it may not be desirable to collocate coherence controllers with memory units because coherence messages (between coherence controllers) must travel between endpoints of the network and thereby increase the latency of remote memory accesses.
A solution has long been sought which would use fewer coherence controllers, be viable for systems based on processors with integrated memory, and reduce the latency of coherence transactions.
The present invention provides a network of communication switches interconnecting the nodes in a cache-coherent multi-processor computer architecture. The nodes connect to communication switches through communication links to form the network. Coherence directories are at the communication switches and integrate the coherence controls into the communication switches. The coherence directories at the communication switch maintain coherence information for all memory lines that are xe2x80x9chomedxe2x80x9d in the nodes that are directly connected to the communication switch.
The present invention provides fewer coherence controllers, is a viable approach for systems based on processors with integrated memory, and also reduces the latency of several coherence transactions.
The above and additional advantages of the present invention will become apparent to those skilled in the art from a reading of the following detailed description when taken in conjunction with the accompanying drawings.