Given the continually increased reliance on computers in contemporary society, computer technology has had to advance on many fronts to keep up with increased demand. One particular subject of significant research and development efforts is parallelism, i.e., the performance of multiple tasks in parallel.
A number of computer software and hardware technologies have been developed to facilitate increased parallel processing. From a hardware standpoint, computers increasingly rely on multiple processors to provide increased workload capacity. Furthermore, some processors have been developed that support the ability to execute multiple threads in parallel, effectively providing many of the same performance gains attainable through the use of multiple processors.
A significant bottleneck that can occur in a multi-processor computer, however, is associated with the transfer of data to and from each processor, often referred to as communication cost. Many computers rely on a main memory that serves as the principal working storage for the computer. Retrieving data from a main memory, and storing data back into a main memory, however, is often required to be performed at a significantly slower rate than the rate at which data is transferred internally within a processor. Often, intermediate buffers known as caches are utilized to temporarily store data from a main memory when that data is being used by a processor. These caches are often smaller in size, but significantly faster, than the main memory. Caches often take advantage of the temporal and spatial locality of data, and as a result, often significantly reduce the number of comparatively-slower main memory accesses occurring in a computer and decrease the overall communication cost experienced by the computer.
Often, all of the processors in a computer will share the same main memory, an architecture that is often referred to as Symmetric Multiprocessing (SMP). One limitation of such computers, however, occurs as a result of the typical requirement that all communications between the processors and the main memory occur over a common bus or interconnect. As the number of processors in a computer increases, the communication traffic to the main memory becomes a bottleneck on system performance, irrespective of the use of intermediate caches.
To address this potential bottleneck, a number of computer designs rely on another shared memory architecture referred to as Non-Uniform Memory Access (NUMA), whereby multiple main memories are essentially distributed across a computer and physically grouped with sets of processors and caches into physical subsystems or modules, also referred to herein as “nodes”. The processors, caches and memory in each node of a NUMA computer are typically mounted to the same circuit board or card to provide relatively high speed interaction between all of the components that are “local” to a node. Often, a “chipset” including one or more integrated circuit chips, is used to manage data communications between the processors and the various components in the memory architecture. The nodes are also coupled to one another over a network such as a system bus or a collection of point-to-point interconnects, thereby permitting processors in one node to access data stored in another node, thus effectively extending the overall capacity of the computer. Memory access, however, is referred to as “non-uniform” since the access time for data stored in a local memory (i.e., a memory resident in the same node as a processor) is often significantly shorter than for data stored in a remote memory (i.e., a memory resident in another node).
Irrespective of the type of architecture used, however, the latency of memory accesses is often a significant factor in the overall performance of a computer. As a result, significant efforts have been directed to obtaining the smallest memory latency possible for any given memory request.
In a computer where processors are coupled to a memory system via an intermediate chipset, read or load requests typically must be forwarded to the chipset via a processor bus that interconnects the requesting processor to the chipset, which then determines where the requested data currently resides (e.g., in main memory, in a shared cache, in the local cache of another processor, or, in the case of a NUMA system, in a memory or cache in a different node). The determination is often made by performing a lookup of a coherency directory, which may be centralized, or in some designs, distributed to multiple points in the architecture. In addition, an update to the coherency directory may also be made based upon the fact that the requested data will be resident in the requesting processor after completion of the request.
Based upon the location of the requested data, the chipset will then initiate the retrieval of the requested data, and once the data is returned, the data is typically stored in a buffer in the chipset. Thereafter, a communications interface in the chipset, e.g., the processor bus interface that couples to the requesting processor over the processor bus, will use the return data by retrieving the data from the central buffer and driving the return data to the requesting processor over the processor bus. The latency of the request is typically measured from the time that the request is forwarded across a processor bus by a requesting processor, until the return data is driven back across the processor bus to the requesting processor.
One operation that can affect the latency of a memory request in conventional designs is associated with updating the coherency directory. Specifically, in many designs, the data returned from a memory or other source, and temporarily stored in a chipset buffer, is not forwarded to the requesting processor by the processor bus interface until after the coherency directory is updated to reflect the new status of the relevant data. This is typically due to the need to verify that the memory request will not need to be canceled prior to returning the data to the requesting processor. In many such designs, therefore, the data being returned waits in the chipset buffer until a confirmation is received from the coherency directory indicating that the data is ready to be forwarded to the requesting processor.
In a multinode system such as a NUMA-based system, a similar issue arises with respect to communicating data requested by another node over the communication link between the nodes. Some conventional designs, for example, utilize scalability port interfaces in a chipset to provide high speed point-to-point interconnections between pairs of nodes. From the perspective of the chipset in a node, the handling of memory requests received over a scalability port is handled much like a memory request from a local processor, with the primary difference being that the communications protocol used on the scalability port is often packet-based, and requires that data be formatted into specific packets of information prior to being sent to another node via the scalability port. From the perspective of performing a lookup of a coherency directory to identify the source of the requested data, updating the coherency directory, retrieving the requested data from the source, storing the return data in a buffer, and waiting for confirmation from a coherency directory, there is little difference between memory requests originated by local processors and those originated by remote nodes.
By requiring the data requested by a processor or another node in a multinode system to wait in the buffer, several cycles of additional latency may be introduced. Furthermore, given the pipelined nature of most memory systems, this requirement typically requires larger buffers to enable the data for multiple requests to be retained in the chipset while awaiting confirmation from the coherency directory. Larger buffers often lead to increased cost and complexity for a given design, and as such, it is typically desirable to minimize the amount of buffering required in a chipset whenever possible.
NUMA-based systems may also be subject to additional latencies associated with processing responses from other nodes whenever data requested by a processor in one node will be sourced by another node via the scalability port. In particular, in many designs a coherency directory on a node will be able to determine that requested data will be sourced by another node, although which particular node will source the data is typically not known. As a result, many such systems utilize a broadcast protocol to forward the request to all other nodes in the system. Then, once each node receives the request, the node determines whether that node should return the requested data. If so, the node returns the data in a response, along with an indication of the state of the data, e.g., whether the node has a shared or exclusive copy of the data. If not, the node still sends a non-data response to confirm that the node received the response, which also may also indicate that the node does not have a valid copy of the data. The node that broadcasts the request typically waits to receive responses from all of the nodes before updating the coherency directory and allowing the return data to be forwarded to the requested processor on the node.
In some designs, a directory protocol may be used in lieu of a broadcast protocol. With a directory protocol, a request is sent to a central directory in the system, which looks up the current node for the requested data and sends a request to that node. The node that receives the request then forwards the requested data back to the original requesting node, and notifies the central directory to indicate a transfer in ownership of the data to the requesting node (if appropriate).
While directory protocols often scale better, broadcast protocols are often preferred for performance reasons, particularly in smaller systems. One drawback of many broadcast protocols, however, results for the need to wait for all responses to a request before allowing a processor on a node to use return data received from another node in the system. In particular, in some circumstances, the requested data may be returned in a response from one node before the responses from other nodes have been received. As a result, even once the requested data is received from another node, several cycles may elapse before all responses are received from the other nodes and the data is forwarded to the requesting processor. Consequently, the return data, which has already been received by the node, may need to be stored in a buffer and held for several cycles.
Therefore, a significant need continues to exist for a manner of minimizing the latency of memory requests in a shared memory data processing system.