1. Technical Field
The present invention relates in general to data processing systems and, in particular, to non-uniform memory access (NUMA) and other multiprocessor data processing systems having improved queuing, communication and/or storage efficiency.
2. Description of the Related Art
It is well-known in the computer arts that greater computer system performance can be achieved by harnessing the processing power of multiple individual processors in tandem. Multi-processor (MP) computer systems can be designed with a number of different topologies, of which various ones may be better suited for particular applications depending upon the performance requirements and software environment of each application. One common MP computer topology is a symmetric multi-processor (SMP) configuration in which each of multiple processors shares a common pool of resources, such as a system memory and input/output (I/O) subsystem, which are typically coupled to a shared system interconnect. Such computer systems are said to be symmetric because all processors in an SMP computer system ideally have the same access latency with respect to data stored in the shared system memory.
Although SMP computer systems permit the use of relatively simple inter-processor communication and data sharing methodologies, SMP computer systems have limited scalability. In other words, while performance of a typical SMP computer system can generally be expected to improve with scale (i.e., with the addition of more processors), inherent bus, memory, and input/output (I/O) bandwidth limitations prevent significant advantage from being obtained by scaling a SMI beyond a implementation-dependent size at which the utilization of these shared resources is optimized. Thus, the SMP topology itself suffers to a certain extent from bandwidth limitations, especially at the system memory, as the system scale increases. SMP computer systems are also not easily expandable. For example, a user typically cannot purchase an SMP computer system having two or four processors, and later, when processing demands increase, expand the system to eight or sixteen processors.
As a result, an MP computer system topology known as non-uniform memory access (NUMA) has emerged to addresses the limitations to the scalability and expandability of SMP computer systems. As illustrated in FIG. 1, a conventional NUMA computer system 8 includes a number of nodes 10 connected by a switch 12. Each node 10, which can be implemented as an SMP system, includes a local interconnect 11 to which number of processing units 14 are coupled. Processing units 14 each contain a central processing unit (CPU) 16 and associated cache hierarchy 18. At the lowest level of the volatile memory hierarchy, nodes 10 further contain a system memory 22, which may be centralized within each node 10 or distributed among processing units 14 as shown. CPUs 16 access memory 22 through a memory controller 20.
Each node 10 further includes a respective node controller 24, which maintains data coherency and facilitates the communication of requests and responses between nodes 10 via switch 12. Each node controller 24 has an associated local memory directory (LMD) 26 that identifies the data from local system memory 22 that are cached in other nodes 10, a remote memory cache (RMC) 28 that temporarily caches data retrieved from remote system memories, and a remote memory directory (RMD) 30 providing a directory of the contents of RMC 28.
The present invention recognizes that, while the conventional NUMA architecture illustrated in FIG. 1 can provide improved scalability and expandability over conventional SMP architectures, the conventional NUMA architecture is subject to a number of drawbacks. First, communication between nodes is subject to much higher latency (e.g., five to ten times higher latency) than communication over local interconnects 11, meaning that any reduction in inter-node communication will tend to improve performance. Consequently, it is desirable to implement a large remote memory cache 28 to limit the number of data access requests that must be communicated between nodes 10. However, the conventional implementation of RMC 28 in static random access memory (SRAM) is expensive and limits the size of RMC 28 for practical implementations. As a result, each node is capable of caching only a limited amount of data from other nodes, thus necessitating frequent high latency inter-node data requests.
A second drawback of conventional NUMA computer systems related to inter-node communication latency is the delay in servicing requests caused by unnecessary inter-node coherency communication. For example, prior art NUMA computer systems such as that illustrated in FIG. 1 typically allow remote nodes to silently deallocate unmodified cache lines. In other words, caches in the remote nodes can deallocate shared or invalid cache lines retrieved from another node without notifying the home node""s local memory directory at the node from which the cache line was xe2x80x9cchecked out.xe2x80x9d Thus, the home node""s local memory directory maintains only an imprecise indication of which remote nodes hold cache lines from the associated system memory. As a result, when a store request is received at a node, the node must broadcast a Flush (i.e., invalidate) operation to all other nodes indicated in the home node""s local memory directory as holding the target cache line regardless of whether or not the other nodes still cache a copy of the target cache line. In some operating scenarios, unnecessary flush operations can delay servicing store requests, which adversely impacts system performance.
Third, conventional NUMA computer systems, such as NUMA computer system 8, tend to implement deep queues within the various node controllers, memory controllers, and cache controllers distributed throughout the system to allow for the long latencies to which inter-node communication is subject. Although the implementation of each individual queue is inexpensive, the deep queues implemented throughout conventional NUMA computer systems represent a significant component of overall system cost. The present invention therefore recognizes that it would advantageous to reduce the pendency of operations in the queues of NUMA computer systems and otherwise improve queue utilization so that queue depth, and thus system cost, can be reduced.
In view of the foregoing and additional drawbacks to conventional NUMA computer systems, the present invention recognizes that it would be useful and desirable to provide a NUMA architecture having improved queuing, storage and/or communication efficiency.
The present invention overcomes the foregoing and additional shortcomings in the prior art by providing a non-uniform memory access (NUMA) computer system and associated method of operation having distributed global coherency management.
In accordance with a preferred embodiment of the present invention, a NUMA computer system includes a home node and one or more remote nodes coupled by a node interconnect. The home node includes a local interconnect, a node controller coupled between the local interconnect and the node interconnect, a home system memory, and a memory controller coupled to the local interconnect and the home system memory. In response to receipt of a data request from the remote node, the memory controller transmits requested data from the home system memory to the remote node and, in a separate transfer, conveys responsibility for global coherency management for the requested data from the home node to the remote node. By decoupling responsibility for global coherency management from delivery of the requested data in this manner, the memory controller queue allocated to the data request can be deallocated earlier, thus improving performance.
The above as well as additional objects, features, and advantages of the present invention will become apparent in the following detailed written description.