The present invention relates generally to a memory access and more particularly, to a system and a method for manipulating requests for shared data in a multi-node computer network system.
Conventional cache coherent non-uniform memory access (xe2x80x9cCC NUMAxe2x80x9d) is known. In a multi-node system+using non-uniform memory access, if a central processing unit (xe2x80x9cCPUxe2x80x9d) accesses memory at its own node, i.e., a local node, the time to access data is fast. By contrast, in a non-uniform memory access at a node other than the central processing unit""s own node, i.e., a remote node, the time to access the data is slow.
A conventional protocol referred to as Modified/Exclusive, Shared, Invalid (xe2x80x9cMESIxe2x80x9d) evolved to help to increase data access speed. In this protocol, a memory controller stores and keeps track of information about data in a multi-node system. It determines on which node data is presently residing in multi-node systems.
The Remote Access Cache (RAC) caches the data for remote requests in order to speed access to remote data by a subsequent memory request from a node in the same group.
In a conventional cache coherent non-uniform memory access (xe2x80x9cCC NUMAxe2x80x9d) when a first processor issues a xe2x80x9cread-to-sharexe2x80x9d or a xe2x80x9cread-to-ownxe2x80x9d request to remote memory, it first needs to access the Remote Access Cache (RAC) and then to access the directory for the remote memory agent. A problem with this approach is that it serializes the RAC access and the remote directory access. The existing approach increases remote memory latency by not allowing overlap of these two operations.
Further, the data in the RAC could be present in the Modified (M) or Exclusive (E) state. If the RAC has the line in E state, then it has xe2x80x9cread-writexe2x80x9d permission for its copy of the line, but it has not yet written to the line. If the RAC is in M state, then it has xe2x80x9cread-writexe2x80x9d permission for its cached copy of the line and it has already modified the line. When the most recent data is in the RAC, and the state of the cache line is M and the RAC supplies the cache line in response to a xe2x80x9cread-to-sharexe2x80x9d or xe2x80x9cread-to-ownxe2x80x9d request. A remote xe2x80x9cread-to-sharexe2x80x9d request which hits a line in the M state in the RAC must downgrade the line state from M to S by writing the line back to memory and return a shared copy to the requestor. A remote xe2x80x9cread-to-ownxe2x80x9d request must send an ownership transfer notification to the directory to indicate who the new owner of the line is. Ownership transfer notification is required because the directory must always track which cache is the exclusive owner of a cache line in the ME state at the directory. However, ownership transfer complicates the protocol.
If the remote xe2x80x9cread-to-sharexe2x80x9d access misses in the RAC, a line which has been modified may first need to be evicted from the RAC in order to create space for the new line to be installed in the RAC. The possibility of cache line eviction requires that the RAC must be read on every xe2x80x9cread-to-sharexe2x80x9d or xe2x80x9cread-to-ownxe2x80x9d access.
In addition, because a cache line can only be present in exactly one RAC in the system in the Exclusive and Modified state, performance does not scale well with a large number of RACs. Once the number of RACs in the system increases, the odds of hitting Exclusive or Modified data in the RAC decline.
Therefore, there is a need for a memory access system and method that reduces latency of remote memory accesses. Such a new system should provide congestion relief by bypassing the RAC when it is busy. In addition, such a new system should simplify the protocol by eliminating eviction of Modified data from the RAC and should eliminate ownership transfer notification of the directory anytime writeback or a HIT to Modified or Exclusive data occurs. Most importantly, such a system should avoid serializing the RAC access and the memory access, thereby reducing memory latency.
A preferred embodiment of the present invention includes a computer network system for accessing data that includes a plurality of groups, each group including a plurality of nodes that couple through an interconnect system, each node including one or more central processing units (or processors) with each processor having a processor cache. Each node further includes a memory agent, a main memory, and a directory coupled to the processors and processor caches.
The system also includes a directory coupled to a Request Outstanding Buffer (ROB) to record the progress of a memory transaction in the system. A cache line is the smallest unit of data that can be stored in cache and tracked by the directory. Data is supplied through the cache line. The information stored in the directory refers to which node(s) has a particular cache line as well as the status of data in those cache lines. The status of data in the cache line at the directory may be, for example, Modified/Exclusive (xe2x80x9cMExe2x80x9d), shared (xe2x80x9cSxe2x80x9d), or invalid (xe2x80x9cIxe2x80x9d). Modified/Exclusive state indicates that the line has been read by a caching memory agent for read-write access. Shared state indicates that the line has been read by a caching memory agent for read-only access. Invalid state indicates that the line is not cached in any cache in the system. If the directory state is Modified/Exclusive (ME), the owning node is also recorded in the directory entry. If the directory state is Shared (S), a list of sharing nodes is recorded in the directory entry.
The system further comprises the ROB coupled to a memory agent to record the progress of a data requests. The ROB may be connected to remote nodes through the global interconnect system. Entries in the ROB include the following fields: REQUEST, STATE, and TRANSACTION ID.
The system further includes a remote access cache (RAC) to cache remote memory references. The RAC caches only clean remote data in S state and does not cache remote data in the ME state. Entries in the RAC include the following fields: ADDRESS TAG, STATE, and DATA.
A preferred method for accessing data comprises: requesting xe2x80x9cread-to-sharexe2x80x9d data from a memory line in a remote node; issuing simultaneously two requests: to the RAC and to the directory for the remote memory node; returning MISS back to the ROB if the cache line in the RAC is not cached; returning data to the requesting processor in group A from the remote memory node and installing the cache line in the RAC. Alternatively, if the cache line in the RAC is cached, returning a RAC HIT to the ROB. The fact that there is a xe2x80x9cHITxe2x80x9d in the cache indicates that the state of the line in the directory is Shared, but not Modified/Exclusive, or Invalid. Then, modifying the STATE field in the ROB accordingly to indicate whether the cache line is cached in the RAC. Finally, returning data to the requesting node. The data received by the ROB from the remote node is discarded once the original request is satisfied with the memory line cached in the RAC.
In the present invention, the xe2x80x9cread-to-sharexe2x80x9d request from the first processor is issued to the RAC and is also simultaneously issued to a remote home node. Overlapping these two operations avoids serializing the RAC access and the memory access. This beneficially reduces memory latency for the case when the RAC access is a MISS. Thus, if the xe2x80x9cread-to-sharexe2x80x9d access from a processor node hits in the cache, then data can be returned and used immediately by the processor without waiting for a response from the directory controller at the remote home node. The fact that there is a HIT in the RAC indicates that the state of the line in the directory is Shared, and not Modified/Exclusive, or Invalid. Since the data in the RAC is only Shared, this obviates the need to wait for the result of the directory lookup.
The present invention also beneficially simplifies the protocol by eliminating evictions of data in the RAC before installing the new cache line. Further, the present invention allows congestion relief since the RAC can be bypassed whenever the RAC is busy. In this situation, the xe2x80x9creadxe2x80x9d request goes directly to the remote home node bypassing the RAC. Data is then returned directly to the requestor and is not installed in the RAC. This is possible because if the cache line is present in the RAC, then it is in the shared state. The data in the RAC is always a copy of the data in the memory. Therefore, the data can be returned from the memory when the RAC is bypassed.
Next, the RAC mechanism of the present invention provides a greater degree of fault tolerance without incurring any performance overhead since a RAC access error can be simply treated as a RAC MISS. When the data from memory is installed, the error is corrected with no additional overhead.
Finally, the presence of the RAC does not increase memory access latency. That is, the latency to remote memory with a RAC MISS is the same as the latency to remote memory without a RAC. Therefore, the RAC can only provide a benefit of performance, even if the miss rate of the RAC is high.