1. Field of the Invention
The present invention is directed to data processing systems. More specifically, the present invention is directed to a method, apparatus, and computer program product that provides an additional cache coherency protocol state that predicts the location of a modified memory block.
2. Description of the Related Art
A conventional symmetric multiprocessor (SMP) computer system, such as a server computer system, includes multiple nodes coupled together using a system interconnect that typically comprises one or more system address, data, and control buses. Commands can be transmitted from one node to another by being broadcast on the system interconnect.
Each node typically includes multiple processing units all coupled to the local node interconnect, which typically comprises one or more address, data, and control buses. Coupled to the node interconnect is a system memory, which represents the lowest level of volatile memory in the multiprocessor computer system and which generally is accessible for read and write access by all processing units. In order to reduce access latency to instructions and data residing in the system memory, each processing unit is typically further supported by a respective multi-level cache hierarchy, the lower level(s) of which may be shared by one or more processor cores.
Because multiple processor cores may request write access to a same cache line of data and because modified cache lines are not immediately synchronized with system memory, the cache hierarchies of multiprocessor computer systems typically implement a cache coherency protocol to ensure at least a minimum level of coherence among the various processor core's “views” of the contents of system memory. In particular, cache coherency requires, at a minimum, that after a processing unit accesses a copy of a memory block and subsequently accesses an updated copy of the memory block, the processing unit cannot again access the old copy of the memory block.
A cache coherency protocol typically defines a set of cache states stored in association with the cache lines of each cache hierarchy, as well as a set of coherency messages utilized to communicate the cache state information between cache hierarchies. In a typical implementation, the cache state information takes the form of the well-known MESI (Modified, Exclusive, Shared, Invalid) protocol, and the coherency messages indicate a protocol-defined coherency state transition in the cache hierarchy of the requester and/or the recipients of a memory access request.
A memory access request is a request to access data within the computer system. The memory access request can be a request to either read or write the particular data. The memory access request includes an address which identifies the particular data to be accessed.
Several copies of the data may exist concurrently within a computer system. These copies may include a slightly different version of the data. The cache coherence protocol is a process for, among other things, tracking which copy of the data is currently valid. Each copy of the data is referred to herein as either a memory block or a cache line. The cache coherence protocol dictates which cache coherence protocol state is associated with each cache line. Therefore, at any given time, each cache line is in one of the cache coherence protocol “states”.
As described above, processor cores in an SMP computer system are clustered into nodes. Each node typically includes multiple processor cores. Two-level cache coherence protocols exploit the clustering of processor cores to conserve global bandwidth by broadcasting read requests to the local node first, and only sending read requests to remote nodes if necessary. Thus, in the prior art, when a processor core needs to read a particular cache line, the processor core always broadcasts the read request to read the particular cache line first to the other processor cores that are included in the broadcasting processor core's node. This node is the local node with respective to the requesting processor core.
If the read request is not satisfied within the local node, the memory access request is then broadcast to all of the remote nodes so that the request can be satisfied within one of the remote nodes.
This two-step process reduces global traffic when a read request can be satisfied within the local node. When the read request can be satisfied within the local node, a global broadcast of the read request to the remote nodes is not necessary and is avoided. However, if none of the processor cores in the local node is able to satisfy the read request, the processor core then broadcasts the memory access request to the remaining nodes. These remaining nodes are the remote nodes.
For write requests, this two step process cannot be used. When a process does a write, it must obtain an exclusive copy of the data. Thus, not only must the processor get the cache line to be written, all other copies of this data in other cache lines must be invalidated. This insures that no other processors are writing to the same memory location at the same time, which is required in order to maintain coherence. To insure that all copies of the data are invalidated, the write request must be broadcast to all nodes. Therefore, the prior art first broadcasts write requests to all nodes including the local node and remote nodes. Read-exclusive requests are similar to write requests in that the requesting processor wants an exclusive copy of the data, so read-exclusive requests are treated like a write. However, getting the cache line exclusive is usually just a performance enhancement and not necessary for the coherence protocol to function properly.
FIG. 7 illustrates a high level flow chart that depicts broadcasting memory access commands in accordance with the prior art. The process starts as depicted by block 700 and thereafter passes to block 702 which illustrates the particular processor core needing to access particular data. This is a memory access request.
Next, block 704 depicts the particular processor core first checking its own local cache to determine if the processor core is able to satisfy the request in its own cache. This is the cache that is included within the processor core that needs to access the particular data. Thereafter, block 706 illustrates a determination of whether or not the processor core was able to satisfy the memory access request within the processor core's own local cache. The memory access request is satisfied within the particular processor core's cache when a valid copy of the data is found within the processor core's cache.
If a determination is made that the processor core was not able to satisfy the memory access request within its local cache, the process passes to block 708 which depicts a determination of whether or not the command is a read or a write or read-exclusive command. If a determination is made that the command is a read command, the process passes to block 710 which illustrates the read request being broadcast to all processor cores in only the node that includes this requesting particular processor core. This node is the local node with respect to the requesting particular processor core. Thus, a read request is always broadcast first to only the local node.
Next, block 712 illustrates a determination of whether or not the processor core was able to satisfy the read request within the processor core's own local node. The read request is satisfied within the local node when a valid copy of the data is found within a cache within one of the other processor cores that are included in this processor core's local node. If a determination is made that the processor core was able to satisfy the read request within its local node, the process passes to block 714 which depicts satisfying the read request within the processor core's node. The process then passes to block 702.
Referring again to block 712, if a determination is made that the processor core was not able to satisfy the read request within its local node, the process passes to block 716 which illustrates the memory access request being broadcast to the remote nodes. Next, block 718 depicts satisfying the memory access request within a processor core that is included within one of the remote nodes. Thereafter, the process passes to block 702.
Referring again to block 708, if a determination is made that the command is either a write or read exclusive command, the process passes to block 716.
Referring again to block 706, if a determination is made that the processor was able to satisfy the memory access request within its local cache, the process passes to block 720 which depicts a determination of whether or not the command is a write or a read exclusive and already the only copy of the data. If a determination is made that the command is a write or read exclusive and already the only copy of the data, the process passes to block 726. Referring again to block 720, if a determination is made that the command is not a write or read exclusive and already the only copy of the data, the process passes to block 722 which illustrates broadcasting the request to all nodes.
Next, block 724 depicts waiting for a completed response. Thereafter, block 726 illustrates satisfying the request within the processor core's cache. The process then passes back to block 702.
Therefore, a need exists for a method, apparatus, and computer program product that provides an additional cache coherency protocol state that indicates that all copies of the data reside in the local node for reducing the number of unnecessarily broadcast global requests in order to conserve global communications bandwidth.