1. Field of the Invention
This invention relates generally to cache coherence in computer systems having multiple processors with caches, and more particularly to a system and method for updating a cache entry from read-only to read-write.
2. Description of the Background Art
Multiple-processor computer systems involve various processors which at the same time may each work on a separate portion of a problem or work on a different problem. FIG. 1 shows a multi-processor system, including a plurality of Central Processing Units (CPUs) or processors 102A, 102B . . . 102N, communicating with memory 104 via interconnect 106, which could be, for example, a bus or a collection of point-to-point links. Processors 102 access data from memory 104 for a read or a write. In a read operation, processor 102 receives data from memory 104 without modifying the data, while in a write operation processor 102 modifies the data transmitted to memory 104.
Each processor 102 generally has a respective cache unit 108A, 108B, . . . 108N, which is a relatively small group of high speed memory cells dedicated to that processor. A processor 102""s cache 108 is usually on the processor chip itself or may be on separate chips, but is local to processor 102. Cache 108 for each processor 102 is used to hold data that was accessed recently by that processor. Since a processor 102 does not have to go through the interconnecting bus 106 and wait for the bus 106 traffic, the processor 102 can generally access data in its cache 108 faster than it can access data in the main memory 104. In a normal operation, a processor 102N first reads data from memory 104 and copies that data to the processor""s own cache 108N. During subsequent accesses for the same data the processor 102N fetches the data from its own cache 108N. In effect, after the first read, data in cache 108N is the same copy of data in memory 104 except that the data is now in a high-speed local storage. Typically, cache 108N can be accessed in one or two cycles of CPU time while it takes a processor 102 15 to 50 cycles to access memory 104. A typical processor 102 runs at about 333 Mhz or 3 ns (nanoseconds) per cycle, but it takes at least 60 ns or 20 cycles to access memory 104.
A measure of data, typically 32, 64, 128, or 2n bytes, brought from memory 104 to cache 108 is usually called a xe2x80x9ccache line.xe2x80x9d The data of which a copy was brought to cache 108 and which remains in memory 104 is called a xe2x80x9cmemory line.xe2x80x9d The size of a cache line or a memory line is determined by a balance of the overhead per read/write operation versus the usual amount of data transferred from memory and cache. An efficient size for a cache line results in transfers spending about 25% of their time on overhead and 75% of their time on actual data transfer.
A particular problem with using caches is that data becomes xe2x80x9cstale.xe2x80x9d A first processor 102A may access data in the main memory 104 and copy the data into its cache 108A. If the first processor 102A then modifies the cache line of data in its cache 108A, then at that instant the corresponding memory line becomes stale. If a second processor, 102B for example, subsequently accesses the original data in the main memory 104, the second processor 102B will not find the most current version of the data because the most current version is in the cache 108A. For each cache-line address, cache coherence guarantees that only one copy of data in cache 108 can be modified. Identical copies of a cache line may be present in multiple caches 108, and thus be read by multiple processors 102 at the same time, but only one processor 102 is allowed to write, i.e., modify, the data. After a processor 102 writes to its cache 108 that processor 102 must xe2x80x9cinvalidatexe2x80x9d any copies of that data in other caches to notify other processors 102 that their cache lines are no longer current.
FIG. 2A shows valid cache lines D0 for caches 108A to 108N whereas FIG. 2B shows cache 108B with an updated cache line D1 and other caches 108A, 108C, and 108N with invalidated cache lines D0. The processors 102A, 102C, and 102N with invalidated cache data D0 in their respective caches 108 must fetch the updated version of cache line D1 if they want to access that data line again.
Normally and for illustrative purposes in the following discussion, cache coherence protocols are executed by processors 102 associated with their related caches. However, in other embodiments these protocols may be executed by one or more specialized and dedicated cache controllers.
There are different cache coherence management methods for permitting a processor 102 to modify its cache line in cache 108 and invalidate other cache lines. One method (related to the present invention) utilizes, for each cache line, a respective xe2x80x9cshared listxe2x80x9d representing cache-line correspondences by xe2x80x9cdouble-linksxe2x80x9d where each cache has a forward pointer pointing to the next cache entry in the list and a backward pointer pointing to the previous cache entry in the list. Memory 104 has a pointer which always points to the head of the list.
FIG. 3 shows a linked list 300 of caches 108A . . . 108N with the associated memory 104. Memory 104 has a pointer which always points to the head (cache 108A) of the list while the forward pointers Af, Bf, and Cf of caches 108A, 108B, and 108C respectively point forward to the succeeding caches 108B, 108C, and 108D (not shown). Similarly, backward pointers Nb, Cb, and Bb of caches 108N, 108C, and 108B respectively point backward to the preceding caches. Because each cache unit 108 is associated with a respective processor 102, a linked list representation of cache 108 is also understood as a linked list representation of processors 102.
There are typically two types of cache sharing list. The first type list is the read-only (sometimes called xe2x80x9cfreshxe2x80x9d) list of caches for which none of the processors 102 has permission to modify the data. The second type list is a read-write (sometimes called xe2x80x9cownedxe2x80x9d) list of caches for which the head-of-list processor 102 may have permission to write to its cache 108. A list is considered xe2x80x9cstablexe2x80x9d after an entry has been completely entered into or completely deleted from the list. Each of the stable list states is defined by the state of the memory and the states of the entries in the shared list. Relevant states of memory include HOME, FRESH, and STALE. HOME indicates no shared list exists, FRESH indicates a read-only shared list, and STALE indicates the shared list is a read-write list and data in the list can be modified. A processor 102 must get authorization to write to or read from memory. A list entry always enters the list as the list head, and the action of entering is referred to as xe2x80x9cprependingxe2x80x9d to the list. If a list is FRESH (the data is the same as in memory), the entry that becomes the newly created head receives data from memory; otherwise it receives data from the previous list head. In a read-write list, only the head is allowed to modify (or write to) its own cache line and, after the head has written the data, the head must invalidate the other stale copies of the shared list. In one embodiment, invalidation is done by purging the pending invalidated entries of the shared list.
FIGS. 4A-4F illustrate how the two types of list are created and grown. Each of the FIG. 4 includes a before and an after list with states of the list and memory 104. In FIG. 4A, initially memory 104 is in the HOME state, indicating there is no cache shared list. Processor 102A requests permission to read a cache line. Since this is a read request, memory 104 changes from the HOME state to the FRESH state, and the resulting after list 402AR is a read-only list with one entry 108A. Cache 108A receives data from memory 104 because cache 108A accesses data that was previously uncached. This starts the read-only list 402AR.
In FIG. 4B processor 102B requests a read permission to enter the read-only list 402B, which is the same list as 402AR of FIG. 4A. Cache 108B then becomes the head of the list 402BR receiving data line from head 108A. The list 402BR is still a read-only list since both entries of the list have asked for read-only permission, and therefore the memory state remains FRESH.
In FIG. 4C, memory 104 is initially in the HOME state and processor 102A requests a read-write permission. Cache 108A then becomes the head of the list 402CR. Because a read-write permission was requested, list 402CR is a read-write list. As soon as memory 104 grants a read-write permission, memory 104 changes from the HOME state to the STALE state.
In FIG. 4D processor 102B requests a read permission to enter a read-write list 402D. Cache 108B becomes the head of the list 402DR. Since memory 104 is initially in the STALE state, the resulting list 402DR is a read-write list and memory 104 remains in the STALE state.
In FIG. 4E, the initial list 402E is read-only with memory 104 in the FRESH state, and processor 102B requests a write permission. Cache 108B then becomes the head of the list 402ER. Since processor 102B asked for a write permission, memory 104 changes state from FRESH to STALE, and list 402ER is a read-write list.
In FIG. 4F the list 402F is a read-write list and processor 102B requests a write permission. Since list 402F is read-write, list 402FR is also read-write, and memory 104 remains in the STALE state. In a read-write list, only the head is allowed to modify (or write to) its own cache line, and, after the head has written the data, the head must invalidate stale copies of the shared list.
In one prior art embodiment, invalidation is done by purging the stale entries of the shared list. Consequently, this invalidation-by-deletion results in a stable read-write list having only one read-write entry.
FIG. 4G shows the original list 402G and the resulting list 402GR in which only the head 108N remains while other list entries have been deleted after an invalidation-by-deletion.
FIG. 4H illustrates an invalidation-by-deletion. In FIG. 4H list 402H is, for example, a read-write list having one cache entry 108A. Cache 108B enters list 402H as a read-write entry, and after writing, cache 108A is deleted, resulting in list 402HR1 with only one cache 108B. And in another example, cache 108C enters list 402HR1 with a read-write request. After cache 108C is modified, cache 108B is deleted, resulting in the read-write list 402HR2 with only one entry 108C. Thus, a read-write entry entering a one-entry read-write list always results in a one-entry read-write list.
FIG. 4I illustrates how a list having a read-write entry at the tail, and that tail may be the only entry in the list. In FIG. 4I cache 108B enters read-write list 402I as a read-only entry, and the resulting list 402IR1 has two cache entries 108A and 108B with cache 108A being at the tail in the read-write state and cache 108B at the head in the read-only state. As long as a read-only cache 108 enters the list, the list keeps growing as list 402IR2 and the read-write cache 108A remains at the tail in the read-write state. In list 402IR2, caches 102N, 102C, and 102B are in the read-only state. However, when a cache 108 enters the list as a read-write, the resulting list becomes a one entry list, like list 402IR3 because processor 102M has invalidated stale cache entries (108A-108N) by deleting them.
The deleting scheme illustrated in FIGS. 4G-4I is inherently inefficient because sequentially deleting every entry from the head to the tail to result in list 402IR3 takes a long time. Long lists with entries being deleted can seriously degrade system performance. Another inefficiency also arises, for example, in list 402IR2 when processor 102A seeks to write to its cache 108A. In one prior art embodiment, cache 108A leaves list 402IR2 and re-enters the list as a new entry. Leaving the list poses inefficiency as the list needs to keep track of members entering and leaving. Entering the list as a new entry also suffers the inefficiency of sequential deletion because the processor 102A has to invalidate or delete other stale entries of the list.
In light of the deficiencies of the prior art, there is a need for a system that can minimize the time to invalidate stale cache copies.
The present invention, advantageous over prior art solutions, improves invalidation times by a factor of nearly two or four as compared to the SCI Standard""s sequential invalidation.
The present invention provides a system and a method for updating a cache-sharing list from read-only to read-write in accordance with standard cache coherence protocols. In one embodiment, an updating request is associated with a tail cache entry, and the processor associated with the cache requests the main memory to update the processor""s cache line. In response to the request, a copy of the cache line is created and made the head of the list. Thus, there are two cache entry copies in two places in the list. The processor then updates its cache and allows concurrent invalidation of stale cache entries in both directions from head-to-tail (forward) and from tail-to-head (backward).
Two preferred forward-invalidation methods are implemented. In the first (slow) method the list head informs the next-list entry that the next-list entry is to be invalidated. In response the next-list entry informs the head of the identity of the next-list entry""s successor, which in turn needs to be invalidated. The next-list entry is then invalidated and the next-list entry""s successor becomes the next-list entry (with respect to the list head) and is subjected to the same invalidation process as the previous next-list entry. The process continues until all entries in the list have been invalidated.
In the second (fast) method of forward invalidation, the head first informs the next-list entry that every stale entry in the list is to be invalidated. In response the next-list entry informs the head that a validate completion signal will be returned to the head. The next-list entry is then invalidated and the invalidation signal is forwarded towards the tail where the last invalidated entry sends to the head a confirmation signal that all invalidations have been completed.
In backward invalidation the tail informs the preceding entry that the preceding entry is about to become a tail, and that upon becoming the tail the preceding entry itself needs to be invalidated. The tail is then invalidated and the preceding entry becomes the tail and is subjected to the invalidation process, and so on until all invalidations have been completed.
Because the fast forward invalidation poses a potential deadlock, this invalidation method is used only when deadlock is free; otherwise when potential deadlock is detected the slow method is used even though the slow method takes about twice as long to complete. By having fast forward invalidation and concurrent backward invalidation the overall invalidation time is improved by about four times as compared to slow forward invalidation alone.