1. Field of the Invention
The present invention relates generally to modular multiprocessor computing systems and, in particular to coherent cache control in such systems, and even more particularly to controlling the purging of cache memories to enhance the system""s performance.
2. Background Information
Modular multiprocessor computing systems typically have a number of multiprocessor nodes interconnected via switching fabric. Each node includes multiple processors, memory arrays, and input/output access hardware, and the interconnecting switching mechanisms. Cache memory, also referred to as xe2x80x9ccache,xe2x80x9d is routinely found in these system.
These modular multiprocessor systems share many resources including memory. However, in such large systems, cache is usually associated with a particular processor and holds the data that that processor is likely to access in the near future. However, input/output ports may also have such private caches. In these large systems the processor and/or the input/output ports may update the contents of their private caches without updating shared memory, and a cache coherent protocol is often utilized to maintain data consistency and accuracy of the data. In such systems the control of the cache becomes more complex, and more particularly the decisions of when to purge or not to purge cached data to memory can significantly affect the efficiency and/or speed of the entire processing system.
In some systems the organization of the cache is in blocks, lines or clusters of lines. The data stored in a cache is exclusive to one owner, where that owner has direct control of the data. In operation a data segment ownership may be claimed by a processor with an up-to-date copy of the data by executing a change-to-dirty (XTD or CTD) operation which grants ownership of the data cache block to the processor. In this context a xe2x80x9cdirtyxe2x80x9d state of data denotes the unique most up-to-date copy of the data in the system. Upon obtaining ownership, invalidations are sent to all other copies in the system. That owner may then change the contents of the data. This occurrence presents a performance issue for multiprocessor systems that is being addressed by this invention. The issue may be stated in question form as when, from improving the system""s performance point of view, should the modified data be purged from the cache to shared memory from which any processor can access that up-to-date data. In particular latency (the time it takes to access data or information) is the system performance parameter being improved by the present invention but other such performance parameters may be improved by advantageous use of the present invention.
If a processor needs to access data it is most efficient to have the data stored in its local cache, but if a remote processor needs to access that same data, it would be efficient to have the local processor purge the up-to-date data to shared memory from which the remote processor can access it. If not purged the remote processor must find in which cache the data resides before obtaining it.
Some shared memory multiprocessor systems send updates to memory as changes occur (commonly referred to as xe2x80x9cwrite-throughxe2x80x9d cache). Clearly this is an inefficient slowing of the entire processing system due to the large number of messages sent.
Cache purging control has evolved to the point that the data remains in the owned cache until the owner needs the space for another data segment. At that time the cache contents are xe2x80x9cvictimizedxe2x80x9d to shared memory. This has the undesirable effect of leaving the data in the cache so that it takes a long time (latency) for a remote processor to read the data. This is called a read-dirty (rddirty) event or operation. This is inefficient because the read operation must go to the memory to find the data segment is not there, then to another processor, and then return to the original requesting processorxe2x80x94three hop process. If the data segment had been purged to. shared memory, the read operation would entail a miss in a remote cache followed by a load directly from the shared memoryxe2x80x94a two hop process. However, if the data segment were needed to be subsequently updated by the local owner processor, purging the data segment would cause the processor to regain ownership (called a change-to-dirty) and therefore slow down the system. If the owner processor had no need to subsequently update the data segment, then it is clear that replacing a three hop with a two hop process would increase the speed and thus the efficiency of such a system. The present invention is directed to resolving this dilemma in an adaptive manner that preserves a balance between. purging and not purging cache segments that enhances system performance.
The limitations of the prior art and objects and advantages of the present invention are employed in a self adapting system and method for determining and purging cache candidates.
The present invention tracks and counts events on local cache data by a local processor and by remote processors. In preferred embodiments the cache data may be characterized as the entire data segment, a line of bytes or a cluster of lines or a blockxe2x80x94any amount of data may be purged or not purged advantageously using the present invention. The present invention recognizes that the system performance is enhanced when the local processor accesses its local cache (by write or read hits) memory, and that system performance is degraded if the local cache data was purged to memory whereupon the local processor must regain ownership of the local cache data from that same memory using a change-to-dirty operation. Similarly, if the local cache data is purged, system performance is enhanced when a remote miss allows the remote processor to load the data directly from memory without having to access the local processor; but the system performance is degraded when the remote processor must execute a read-dirty from the local processor cache. The local cache data adaptively becomes a candidate for purging or is purged or not purged depending upon the history, tracking and comparing of these performance enhancing and degrading events as described in the present invention.
In some systems the present invention uses the eviction of cache data from one cache level or another as a trigger to cause the block to be purged. Such cache levels are known to practitioners in the art.
In a further aspect of the invention a threshold may be formed against which the comparing of the enhancing and degrading events are counted. A count total that meets or exceeds the threshold, herein defined as triggering the threshold, causes the cache segment to become a candidate for purging. However, if a degrading event occurs after the purging actually occurs but before an enhancing events occurs or a write-hit occurs after a fake time out occurs, the count may be reset to zero and the cache data ceases to be a candidate for purging and the counting begins anew. A fake time out is triggered when a degrading event occurs but the threshold is not triggered.
Read dirty means that a remote processor must access a local processor""s cache. Change to dirty means that a local processor which has a copy of the data, but not ownership of the data, wants to write the data and needs to obtain ownership of the data. Remote miss means that the remote processor loads from memory. Victim means that the local processor stores cache data to memory. Purge means that the local processor stores cached data to memory, but keeps a read-only copy in its cache.
In another example, the threshold count value may be determined heuristically or calculated from the known characteristics of the system. Of course the threshold may be changed in an adaptive manner as described herein for the counter and/or to suit particular conditions.
In yet another example, a time out delay may be employed after the threshold is triggered. Here the local cache is purged after the time out expires. In one instance the delay may embody counting read-dirties or misses on the local cache, or counting clock cycles or virtually any other timing standard.
Also, the present invention encompasses cases where the local cache includes blocks, individually addressed cache lines, preferably of sixty-four bytes, and/or clusters of cache lines.