1. Field of the Invention
This invention relates to the management of cache memory in data processing systems employing directory-based cache coherence techniques. More particularly, the invention is directed to the reduction of memory access latency associated with cache coherence directory misses on external caches.
2. Description of the Prior Art
By way of background, modern computer systems utilize high-speed cache memory coupled with an associated cache controller (collectively referred to as a “cache”) as a bridge between processors and the relatively slow main memory that holds processor data. As shown in FIG. 1, a cache typically resides in close proximity to the processor it serves, with one or more cache levels often being situated on the processor die itself. The function of a cache is to temporarily store selected subsets of main memory, particularly memory blocks that are frequently accessed. The cached information is thus available to quickly satisfy memory accesses without the latency associated with access requests to main memory.
In multiprocessor systems, such as that shown in FIG. 2, each processor typically has its own cache, and each cache may independently store a copy of the same memory block from a main memory shared by all processors via a common system bus. This situation requires that a cache coherence scheme be used in order to ensure that data consistency is maintained between the several processors. As is well known in the art, a “bus-snooping” protocol is commonly used for this purpose. Bus snooping is premised on the idea that the bus is the broadcast medium for all processor-initiated read and write requests to memory. Each processor's cache is a bus agent that can thus listen or “snoop” on the bus in order to apprise itself of bus-related actions taken by other caches with respect to shared memory blocks. When a processor wants to update a memory block and a memory request to write the block is placed on the bus by its cache, all other caches holding the same memory block will know to invalidate their copy. The cache associated with the block-writing processor will now have the only valid copy of the memory block in the system (until the block is written to main memory). When a processor requests a memory block and its cache places a read request on the bus, another cache holding a valid copy of the requested block can respond. If the main memory has the only valid copy, the request will be satisfied from this memory. If a processor cache has the only valid copy, the request must be satisfied by that cache.
In larger-scale multiprocessor systems, such as that shown in FIG. 3, the main memory of the system is often distributed among plural processing nodes that are interconnected by a network. Each node typically comprises a small-scale multiprocessor system as described above relative to FIG. 2 (i.e., several processors accessing a main memory over a shared bus). The local memory of each node provides a portion of the overall system memory. A processor at any given node can access its own local memory as well as the memories of other nodes. For a given node, the memory at any other node in the system is typically referred to as remote memory or external memory.
A distributed directory-based cache coherence scheme is commonly used to maintain coherence between the caches of different nodes, all of which could theoretically hold a copy of the same memory block. Each node maintains a cache coherence directory to keep track of which processors in the system have cached memory blocks from that node's local memory. Each directory entry typically contains a tag corresponding to the address of a given memory block, identifying information for locating all processors that are caching the block, and a status field indicating whether the cached copies are valid. A node's directory information is used to evaluate read and write requests pertaining to the node's memory blocks and to send out coherency messages to all caches that maintain copies. When a processor in the system updates a shared memory block, the directory having jurisdiction over the memory block is consulted to determine which caches hold copies of the block. Before the write operation can proceed, invalidation messages are sent to the identified caches and invalidation acknowledgements must be returned to verify that all cached copies have been invalidated. In similar fashion, when a processor requests read access to a shared memory block, the directory having jurisdiction over the block is consulted to identify the location and status of all cached copies. Based on the information in the directory, the requested block can be provided to the requester from one of the caches holding a valid copy, or from the main memory of the node that stores the block.
Within each node, the job of managing the cache coherence directory and coordinating the exchange of coherency messages is performed by an intelligent processing agent known as a “coherence controller.” As shown in FIG. 3, each coherence controller is connected so that it can communicate concurrency messages with its peers on other nodes by way of the system interconnection network. Each coherence controller also sits as a snooping agent on the local memory bus of it host node. The coherence controllers can thus keep track of all external and local caching of memory blocks under their jurisdiction.
Because cache coherence directories are sometimes large, and are usually stored in relatively low-speed memory, it is common for coherence controllers to implement a high-speed directory cache in order to temporarily store subsets of relevant directory entries. This can greatly reduce the latency associated with directory lookups. In order to populate the directory cache, a coherence controller will perform prefetches (speculative lookups) of directory entries prior to receiving actual lookup requests for particular memory blocks. Conventional algorithms based on principles of spatial locality can be used to select optimal prefetch candidates. For example, following a directory lookup of a particular memory block requested by some processor as part of a read or write operation, the caching algorithm may attempt to prefetch into the directory cache some number of additional directory entries corresponding to memory blocks whose addresses are proximate to that of the requested block.
One issue that arises when using a directory prefetching scheme is that a prefetch operation may result in a directory “miss” on the candidate directory entry. A directory miss signifies that a memory block associated with the prefetch attempt is not cached anywhere in the system outside of the local node, i.e., there is no copy present in any external cache. In that case, there will be either no directory entry for the memory block, or a directory entry will exist but will be marked invalid. Such a directory entry will not be placed in the directory cache insofar as the caching algorithm is designed to discard invalid entries to make room for a new cache entries and adding an invalid entry could potentially replace a valid entry, thus consuming time and system resources to replace that entry.
Subsequently, when the memory block associated with the prefetch miss is actually requested for reading or writing by a local processor, the cache coherence directory will again be accessed and a directory miss will again occur. Note that because the requester is a local processor, the second directory lookup is entirely unwarranted. The fact that there is a directory miss condition signifies there are no external cached copies of the memory block and that a local copy of the memory block can be used without having to notify other nodes. Insofar as the coherence controller is a local bus snooping agent whose snoop response must be awaited before a memory block request from a local requester can be satisfied, the local node experiences undue processing delay. Had it been known that the directory lookup would miss, the requesting processor could have obtained the memory block locally many cycles earlier, without waiting on the coherence controller.
It would be desirable to provide a solution to the foregoing problem whereby such latency can be avoided, particularly since local memory requests tend to predominate at any given node whereas requests to remote or external memories on other nodes are more infrequent.