1. Technical Field
The present invention relates in general to a method and system for data processing and, in particular, to data processing within a non-uniform memory access (NUMA) data processing system. Still more particularly, the present invention relates to a NUMA data processing system and method of communication in a NUMA data processing system in which transactions that receive a Retry response at a target processing node are held at the target processing node prior to being returned to the requesting processing node.
2. Description of the Related Art
It is well-known in the computer arts that greater computer system performance can be achieved by harnessing the processing power of multiple individual processors in tandem. Multi-processor (MP) computer systems can be designed with a number of different topologies, of which various ones may be better suited for particular applications depending upon the performance requirements and software environment of each application. One of the most common MP computer topologies is a symmetric multi-processor (SMP) configuration in which multiple processors share common resources, such as a system memory and input/output (I/O) subsystem, which are typically coupled to a shared system interconnect. Such computer systems are said to be symmetric because all processors in an SMP computer system ideally have the same access latency with respect to data stored in the shared system memory.
Although SMP computer systems permit the use of relatively simple inter-processor communication and data sharing methodologies, SMP computer systems have limited scalability. In other words, while performance of a typical SMP computer system can generally be expected to ID improve with scale (i.e., with the addition of more processors), inherent bus, memory, and input/output (I/O) bandwidth limitations prevent significant advantage from being obtained by scaling a SMP beyond a implementation-dependent size at which the utilization of these shared resources is optimized. Thus, the SMP topology itself suffers to a certain extent from bandwidth limitations, especially at the system memory, as the system scale increases. SMP computer systems also do not scale well from the standpoint of manufacturing efficiency. For example, although some components can be optimized for use in both uniprocessor and small-scale SMP computer systems, such components are often inefficient for use in large-scale SMPs. Conversely, components designed for use in large-scale SMPs are impractical for use in smaller systems from a cost standpoint.
As a result, an MP computer system topology known as non-uniform memory access (NUMA) has emerged as an alternative design that addresses many of the limitations of SMP computer systems at the expense of some additional complexity. A typical NUMA computer system includes a number of interconnected nodes that each include one or more processors and a local xe2x80x9csystemxe2x80x9d memory. Such computer systems are said to have a non-uniform memory access because each processor has lower access latency with respect to data stored in the system memory at its local node than with respect to data stored in the system memory at a remote node. NUMA systems can be further classified as either non-coherent or cache coherent, depending upon whether or not data coherency is maintained between caches in different nodes. The complexity of cache coherent NUMA (CC-NUMA) systems is attributable in large measure to the additional communication required for hardware to maintain data coherency not only between the various levels of cache memory and system memory within each node but also between cache and system memories in different nodes. NUMA computer systems do, however, address the scalability limitations of conventional SMP computer systems since each node within a NUMA computer system can be implemented as a smaller SMP system. Thus, the shared components within each node can be optimized for use by only a few processors, while the overall system benefits from the availability of larger scale parallelism while maintaining relatively low latency.
A principal performance concern with CC-NUMA computer systems is the latency associated with communication transactions transmitted via the interconnect coupling the nodes. Because of the relatively high latency associated with request transactions transmitted on the nodal interconnect versus transactions on the local interconnects, it is useful and desirable to reduce unnecessary communication over the nodal interconnect in order to improve overall system performance.
In accordance with the present invention, a non-uniform memory access (NUMA) computer system includes at least a local processing node and a remote processing node that are each coupled to a node interconnect. The local processing node includes at least a processor and a local system memory, and the remote processing node includes at least a processor having an associated cache memory, a local system memory, and a node controller that are each coupled to a local interconnect. In response to receipt by the node controller of a request transaction transmitted from the local processing node via the node interconnect, the node controller at the remote processing node issues the request transaction on the local interconnect. If the request transaction receives a retry response at the remote processing node, the node controller does not immediately return the retry response to the local processing node. Instead, the node controller reissues the request transaction on the local interconnect at least once, thus giving the request transaction another opportunity to complete successfully. In one embodiment, the request transaction is reissued on the local interconnect of the remote processing node until a response other than retry is received or until a retry limit is reached.
All objects, features, and advantages of the present invention will become apparent in the following detailed written description.