Caching is an accepted technique to increase computer performance while reducing cost. High performance computers are built with a hierarchy of storage devices, the fastest, smallest and most expensive part of the hierarchy being placed closest to the processor, with successively slower, larger and cheaper levels of storage being further away. The innermost level of the hierarchy is the processor registers, with successive levels being primary cache memory, secondary cache memory, main memory and disk. With each further level the cost per byte of the memory decreases and the access time increases. Each level of the memory hierarchy replicates a subset of the memory hierarchy below it.
Caches are effective because programs exhibit spatial and temporal locality. Temporal locality specifies that memory locations that are accessed by a processor tend to be accessed again soon. Spatial locality specifies that memory locations that are accessed by a processor tend to be near memory locations that will be accessed soon. Caches are cost effective because they limit the amount of faster more expensive storage that must be used to obtain a certain performance point. The processor attempts to access all memory from the closest level of the hierarchy, going to the next level of the hierarchy only when the closer level does not contain the data.
Multiprocessor systems with multiple processors and shared global memory also use caches to reduce memory latency. FIG. 1 contains a logical representation of a distributed multiprocessor with caches, one per processor. The memory is physically distributed among the processors, but is addressed as one large shared memory through the network.
The presence of multiple caches permits multiple copies of the data and introduces an additional level of complexity. Data can become out-of-date in one cache because another processor has written the same data in another cache. To keep the caches coherent (see J. Archibald and J. L. Baer, An Economical Solution to the Cache Coherence Problem. In Proceedings of the 12th Annual International Symposium on Computer Architecture, pp. 355-362, IEEE, New York, June 1985.) so that all processors always use the most up-to-date version of a piece of data, cache coherency algorithms are used.
For scalable machines with distributed memory, an invalidation-based single writer protocol (see L. Censier, and P. Feautrier, A New Solution to Coherence Problems in Multicache Systems, IEEE Transactions on Computers, C(27):1112-1118, 1978.) is often used to provide coherency. Under such protocols, each cache keeps track of the state of each cache line. A cache line is either in the readable state, the writeable state or the invalid state. In the readable state, the cache line can be read by a processor, but not written. In the writeable state, it can be read and written. In the invalid state it can be neither read nor written. A cache line can be in the readable state in multiple caches simultaneously but can be cached in the writeable state by at most one cache. The cache which contains the line in the writeable state is said to be the owner of the line. The ownership information for memory is tracked by a distributed set of directories, or by a central directory. For a processor to write a cache line, all other cached copies elsewhere in the system must be invalidated. Those caches that are invalidated will subsequently take cache misses if the cache line is accessed again, fetching the most up-to-date copy of the cache line from the last writer.
In large multiprocessor systems, cache misses that interact with remote caches can take a long time, relative to processor operating speeds. It is, therefore, desirable to overlap the long miss latency with further computation. To overlap miss latency with subsequent computation, write misses are often pipelined such that the processor resumes operation before the write miss is complete (see D. Lenoski, J. Laudon, K. Gharachorloo, W. D. Weber, A. Gupta, J. Hennessy, M. Horowitz and M. Lam, The Stanford DASH Multiprocessor. IEEE Computer, 25(3):63-79, March 1992., and J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum and J. Hennessy, The Stanford FLASH Multiprocessor. In Proceedings of the 21st Annual International Symposium on Computer Architecture, pp. 302-313, April 1994.). In such systems, with multiple caches and multiple processors, a write miss caused by a processor has three distinct parts that can occur partly in parallel.
1. The up-to-date data must be brought into the cache that initiated the miss, if not already present. PA1 2. The ownership of the cache line must be transferred to the cache. The transfer of ownership can be implied in the data arriving in the writeable state. PA1 3. All other caches with readable copies of the cache line must invalidate their copies.
From the standpoint of the cache that initiated the miss, the miss is complete when steps one and two have occurred. At that point ownership is considered to have been transferred to the new cache, the write miss is cleared, and the write miss can be retired from any write buffer, if present. The transaction is not fully complete globally, however, until all other readable copies have been invalidated. Hence, the write miss continues to go on while the processor that caused the write miss resumes operation. In this way, long write latency is partially pipelined with further computation. In addition, the processor write buffer is used more efficiently (if present), since it can retire the write earlier.
In order to guarantee correctness of programs that overlap write latency in the manner described above, a programmer must properly label the program synchronization, whereby all accesses to shared memory are protected by special labeled locks and barriers. Locks and barriers are forms of program synchronization. Properly labeled programs do not expect a data item to be coherent until the synchronization object that protects that object is acquired (see K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta and J. Hennessy, Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pp 15-26, May 1990.).
Synchronization operations are further broken down into acquire and release events. Machines that pipeline write misses in the manner described above need to ensure that, before any user synchronization object is released by a processor, all previous writes by that processor are globally complete (steps one, two and three are completed). Such machines are said to operate under the assumption of Release Consistency (see K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta and J. Hennessy, Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pp 15-26, May 1990.). Release consistency stipulates that a processor must not release a synchronization object until all writes between the acquire and the release have been globally completed. A write is not globally complete until all other cached copies of the data have been invalidated.
When a write is globally complete, it is visible to all processors in the system: any processor in the system reading the memory location that was written will read the new value. To implement this behavior, a synch memory operation is defined. When a processor issues a synch operation, it stalls until all its previous writes have globally completed. This guarantee is stronger than what is required by release consistency, but easier to implement. To provide user synchronization that conforms to the release consistent model, the user synchronization library routines, for example barriers and lock releases, call synch before attempting to execute the respective synchronization. The synch call does not typically stall the processor, because most writes will have globally completed by the time the processor issues the next synch. In summary, the programmer agrees to conform to the rules of release consistency to obtain better performance.
The synch operation, as described above, for a multiprocessor system with a single processor per cache, is typically implemented using one counter per processor. FIG. 2 shows a possible block-level diagram for a multiprocessor system with one processor per cache, showing one node expanded. The cache controller increments the counter when a write miss occurs and decrements it when a write miss globally completes. System operation provides for signaling to a cache controller when a miss has globally completed. In a directory based system, either the global completion can be deduced by the requesting node when all invalidation acknowledgments return (see D. Lenoski, J. Laudon, K. Gharachorloo, W. D. Weber, A. Gupta, J. Hennessy, M. Horowitz and M. Lam, The Stanford DASH Multiprocessor. IEEE Computer, 25(3):63-79, March 1992.) or the global completion is detected by the home and an explicit message indicating completion is sent to the original requester (see J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum and J. Hennessy, The Stanford FLASH Multiprocessor. In Proceedings of the 21st Annual International Symposium on Computer Architecture, pp. 302-313, April 1994.).! If the processor issues a synch operation, as part of a barrier or lock release for example, the synch operation will stall until the write miss counter decreases to zero.
Systems can also be built with multiple processors sharing a cache (shared-cache systems), as shown in FIG. 3. The system contains multiple nodes where each node contains multiple processors sharing a cache. In such systems it is once again desirable to implement a cache controller such that cache misses can be retired when ownership is returned, before global completion of the write miss occurs. In such systems the counter scheme used in prior art to implement a synch operation will not work (scheme is shown in FIG. 3). If a single counter per cache is used to keep track of the outstanding global misses to that cache, then the counter may not be used by any one processor to guarantee that previous writes have completed, while guaranteeing forward progress: another processor may take a cache miss and increment the counter even after the first processor stalls on the counter, such that it never goes to zero.
A scheme that uses simple per-processor outstanding-miss counters will not work in shared-cache systems either, without changing the processor hit mechanism. For example, take the following case: One processor takes a write miss, incrementing its outstanding-miss counter. The data and ownership return to the cache and the write miss is retired, but global completion of the write continues on in parallel. Another processor attached to the same cache then writes a different part of the same cache line. That write will hit in. the cache, since the write miss has been cleared. However, for correctness, that processor should wait at its next release point until the write miss for that cache line globally completes. But, since its outstanding-miss counter has not been incremented, it will not wait, and incorrect operation will result.
Prior art schemes, therefore, do not permit the pipelining of write miss latency in the presence of shared caches.
Although the problem has been described in terms of hardware multiprocessors sharing a hardware cache, the same problem exists in multiple context processors, where multiple threads of program execution again access the same shared hardware cache.
The problem also exists in shared-memory computer systems that use the virtual memory mapping hardware to maintain coherency on a virtual page basis and have multiple threads of execution accessing the main-memory page cache, on each node virtual shared-memory systems (see K. Li and P. Hudak, Memory Coherence in Shared Virtual Memory Systems. Transactions on Computers, 7(4):321-359, November 1989.)!. The page cache would then by managed by software.
As technology advances, it is likely that hardware shared cache systems with multiple threads of execution per cache will be built. Therefore, there is a need in the prior art for a cache management scheme that can hide write latency while maintaining the memory semantics of release consistency for shared cache, multiple-context and virtual shared-memory systems. The present invention provides a scheme that permits these types of systems to hide write latency by pipelining write misses.