1. Field of the Invention
The present invention is directed to data processing systems. More specifically, the present invention is directed to a method, apparatus, and computer program product that provides an additional cache coherency protocol state that predicts the location of a shared memory block.
2. Description of the Related Art
A conventional symmetric multiprocessor (SMP) computer system, such as a server computer system, includes multiple nodes coupled together using a system interconnect that typically comprises one or more system address, data, and control buses. Commands can be transmitted from one node to another by being broadcast on the system interconnect.
Each node typically includes multiple processing units all coupled to the local node interconnect, which typically comprises one or more address, data, and control buses. Coupled to the node interconnect is a system memory, which represents the lowest level of volatile memory in the multiprocessor computer system and which generally is accessible for read and write access by all processing units. In order to reduce access latency to instructions and data residing in the system memory, each processing unit is typically further supported by a respective multi-level cache hierarchy, the lower level(s) of which may be shared by one or more processor cores.
Because multiple processor cores may request write access to a same cache line of data and because modified cache lines are not immediately synchronized with system memory, the cache hierarchies of multiprocessor computer systems typically implement a cache coherency protocol to ensure at least a minimum level of coherence among the various processor core's “views” of the contents of system memory. In particular, cache coherency requires, at a minimum, that after a processing unit accesses a copy of a memory block and subsequently accesses an updated copy of the memory block, the processing unit cannot again access the old copy of the memory block.
A cache coherency protocol typically defines a set of cache states stored in association with the cache lines of each cache hierarchy, as well as a set of coherency messages utilized to communicate the cache state information between cache hierarchies. In a typical implementation, the cache state information takes the form of the well-known MESI (Modified, Exclusive, Shared, Invalid) protocol, and the coherency messages indicate a protocol-defined coherency state transition in the cache hierarchy of the requester and/or the recipients of a memory access request.
A memory access request is a request to access data within the computer system. The memory access request can be a request to either read or write the particular data. The memory access request includes an address which identifies the particular data to be accessed.
Several copies of the data may exist concurrently within a computer system. These copies may include a slightly different version of the data. The cache coherence protocol is a process for, among other things, tracking which copy of the data is currently valid. Each copy of the data is referred to herein as either a memory block or a cache line. The cache coherence protocol dictates which cache coherence protocol state is associated with each cache line. Therefore, at any given time, each cache line is in one of the cache coherence protocol “states”.
As described above, processor cores in an SMP computer system are clustered into nodes. Each node typically includes multiple processor cores. Two-level cache coherence protocols exploit the clustering of processor cores to conserve global bandwidth by broadcasting read requests to the local node first, and only sending the requests to remote nodes if necessary. Thus, in the prior art, when a processor core needs to read a particular cache line, the processor core always broadcasts the read request to access the particular cache line first to the other processor cores that are included in the broadcasting processor core's node. This node is the local node with respective to the requesting processor core.
If the read request is not satisfied within the local node, the read request is then broadcast to all of the remote nodes so that the request can be satisfied within one of the remote nodes.
This two-step process reduces global traffic when a read request can be satisfied within the local node. When the read request can be satisfied within the local node, a global broadcast of the read request to the remote nodes is not necessary and is avoided. However, if none of the processor cores in the local node is able to satisfy the read request, the processor core then broadcasts the read request to the remaining nodes. These remaining nodes are the remote nodes.
FIG. 7 illustrates a high level flow chart that depicts broadcasting a read command to access particular data first to a local node and then broadcasting the command to remote nodes if a valid copy of the data is not found in the local node in accordance with the prior art. The process starts as depicted by block 700 and thereafter passes to block 702 which illustrates the particular processor core needing to read particular data. This is a read request.
Next, block 704 depicts the particular processor core first checking its own local cache to determine if the processor core is able to satisfy the request in its own cache. This is the cache that is included within the processor core that needs to access the particular data. Thereafter, block 706 illustrates a determination of whether or not the processor core was able to satisfy the read request within the processor core's own local cache. The read request is satisfied within the particular processor core's cache when a valid copy of the data is found within the processor core's cache. If a determination is made that the processor was able to satisfy the read request within its local cache, the process passes to block 708 which depicts satisfying the read request within the processor core's cache. The process then passes back to block 702.
Referring again to block 706, if a determination is made that the processor core was not able to satisfy the read request within its local cache, the process passes to block 710 which illustrates the read request being broadcast to all processor cores in only the node that includes this requesting particular processor core. This node is the local node with respect to the requesting particular processor core. Thus, the request is always broadcast first to only the local node.
Next, block 712 illustrates a determination of whether or not the processor core was able to satisfy the read request within the processor core's own local node. The read request is satisfied within the local node when a valid copy of the data is found within a cache within one of the other processor cores that are included in this processor core's local node. If a determination is made that the processor core was able to satisfy the read request within its local node, the process passes to block 714 which depicts satisfying the read request within the processor core's node. The process then passes to block 702.
Referring again to block 712, if a determination is made that the processor core was not able to satisfy the read request within its local node, the process passes to block 716 which illustrates the read request being broadcast to the remote nodes. Next, block 718 depicts satisfying the read request within a processor core that is included within one of the remote nodes. Thereafter, the process passes to block 702.
While this two-step read process described above reduces the global request traffic when a request can be satisfied within the local node, requests for data that is not located in the local node are delayed because the local node is always checked first.
Therefore, a need exists for a method, apparatus, and computer program product that provides an additional cache coherency protocol state that predicts the location of a shared memory block for reducing the number of unnecessarily broadcast local requests in order to conserve local communications bandwidth.