I. Field of the Invention
The invention relates generally to caches in a multiprocessor environment and more particularly to a method for fetching lines of data from a cache that are potentially dirty.
II. Description of the Prior Art
Modern high performance stored program digital computers conventionally fetch instructions and data from main memory and store the fetched instructions and data in a cache memory. A cache is a local memory that is typically much smaller and much faster than the main memory of the computer. Virtually all high performance digital computers use a cache and even some commercially available microprocessors have local caches.
Caches were developed because it has not been possible to build extremely large memories at a reasonable cost that operate having an access time commensurate with modern day pipelined processors. It is, however, possible to build less expensive, small memories that can keep up with the processor. Since an instruction and its needed data in the cache can be immediately accessed by the processor, caches usually speed up computer performance.
Normally a processor (CP) accesses main storage (MS) data through its cache. A cache is usually organized as a 2-dimensional array, in which each array entry contains a fixed size block of MS data called a line. The directory of a cache describes the addressing information for its lines. When an access of instruction or data from the CP is located in the cache via directory lookup the access is said to hit the cache. Otherwise we say that the access misses in the cache. Upon a cache miss the cache control will generate a request to move the requested line into the cache. When a line is inserted into the cache it may replace an existing line. A cache is normally managed with certain replacement strategies such as the well known Least-Recently-Used (LRU) replacement algorithm. Depending on the cache design, the replacement of a line from cache may require update of the replaced contents to MS in order to maintain consistency of the storage.
Caches can be used in both multiprocessor and uniprocessor systems. In the type of multiprocessor (MP) system known as the tightly coupled multiprocessor system in which several CPs have their own caches that share a common operating system and memory, there are additional problems since it is necessary for each processor's cache to know what has happened to lines that may be in several caches simultaneously. In a multiprocessor system where there are many CPs sharing the same main storage, each CP is required to obtain the most recently updated version of data according to architecture specifications when access is issued. This requirement necessitates constant monitoring of data consistency among caches, often known as the cache coherence problem.
There are various types of caches in prior art multiprocessor systems. One type of cache is the store through (ST) cache as described in U.S. Pat. No. 4,142,234 for IBM System/370 Model 3033 MP. ST cache design does not interfere with the CP storing data directly to the main storage (or second level cache) in order to always update changes of data to main storage. Upon the update of a store through to main storage appropriate cross invalidate actions may take place to invalidate possible remote copies of the stored cache line. The storage control element (SCE) maintains proper store stacks to queue the MS store requests and standard communications between buffer control element (BCE) and SCE will avoid store stack overflow conditions. When the SCE store stack becomes full the associated BCE will hold its MS stores till the condition is cleared.
Another type of cache design is the store-in cache (SIC). SICs are described in U.S. Pat. Nos. 3,735,360 to Anderson et al. and 3,771,137 to Warner et al. A SIC cache directory is described in detail in U.S. Pat. No. 4,394,731 to Flusche et al. in which each line in a store-in cache has its multiprocessor shareability controlled by an exclusive/read only (EX/RO) flag bit. The main difference between ST and SIC caches is that, all stores in SIC are directed to the cache itself (which may cause a cache miss if the stored line is not in the SIC cache). It is also proposed in U.S. Pat. No. 4,503,497 that data transfers upon a miss fetch can take place through a cache to cache transfer bus (CTC) if a copy is in the remote cache. A SCE is used that contains copies of the directories in each cache. This permits cross interrogate (XI) decisions to be resolved at the SCE. Usually cache line modifications are updated to main storage only when the lines are replaced from the cache.
A cache line that is RO is valid only in a read only state. The processor can only fetch from the line. Stores into the line are prohibited. A RO cache line may be shared simultaneously among different caches.
A cache line that is EX is valid but only appears in the cache of one processor. It is not resident in any other (remote) cache. Only the (owning) processor is allowed to store into the line.
A cache line that is CH indicates that not only is the line valid and EX but that it has been stored into. That is the copy in main storage may not be up to date. When a CH line is replaced a copy is sent to main storage via a castout action.
An INV cache line is a line that is invalid.
In a typical computer system a first CP, P.sub.1, may access an instruction or data from a line in a cache. Its own cache will be checked and if the particular line requested is read only (RO) it may make a store request, and via the storage control element (SCE), make that line exclusive (EX). Once the line is made exclusive, the storage control element (SCE) indicates to the other caches that the line is invalid and the first cache will be free to write into that line.
In the multiprocessor cache environment a problem known as the Cross-Interrogate (XI) problem occurs as a result of relatively close accesses of the same data line by different processors (CP's). For instance, if a line is modified by CP P.sub.1 other CP's may fetch a dirty copy of a line L if line L is fetched from memory before the modifications by P.sub.1 are updated to the memory.
It becomes increasingly difficult to handle the XI problem efficiently as more CP's are added to a the system. For illustration purposes, in the following consider a multiprocessor system in which there are N CP's {P.sub.i .vertline.1.ltoreq.i.gtoreq.N} and a private cache C.sub.i for each P.sub.i. For purposes of the present discussion a memory hierarchy in which shared main memory is the one below private caches is assumed.
One major problem with ST cache design is the traffic generated by all CPs in the system. However, a trend in future MP systems is the availability of high performance shared storage among all processors. An example of such fast shared storage is the shared second level cache (L2). With the provision of such high performance shared storage it becomes attractive to implement MP systems with ST caches while still support more CPs. Yet another problem with ST design is the busy store handshaking with SCE as illustrated in U.S. Pat. No. 4,142,234. In such design the data item being stored by a CP cannot be fetched by the same CP till the CP receives acknowledgement of the store from the SCE. Such busy handshake not only slows down the processor pipeline operation but also makes it difficult for the SCE to efficiently serialize all the stores when there are more CPs.
One known approach to the busy store handshake problem for ST design is to employ the EX/RO states from SIC design. Consider a store-thru cache MP environment in which at any moment, a cache line may have any one of the three states INV, RO or EX. INV indicates invalidity. RO indicates the possibility of simultaneous access of different copies of the line from more than one CP. EX guarantees that no other cache can have a copy of the line for access. A typical implementation of this multiprocessor cache scheme is as follows. Upon the fetch of a line L the line is brought into the cache with either RO or EX state (depending on the particular instance and the particular cache scheme). When, however, a store is requested on a line, the system should guarantee the EX state to the line before the line can be stored into. This granting of the EX state may involve XI actions to invalidate copies of the line from other caches. When a CP, for example, P.sub.2, has a line L held EX in its cache and P.sub.1 wants to access L, the Storage Control Element (SCE) ensures that P.sub.1 is allowed to fetch L into its cache. Processes included between the time the SCE signals P.sub.2 to give up EX state on L and the point when the SCE receives the signal that P.sub.2 has given up the EX state with all pending stores updated to memory is called a clearing procedure. The purpose of a clearing procedure is to have the XI target CP give up its EX control of the line and to get any possible uncaptured stores to the line updated to the memory. The above described XI-hit to Remote EX (XIEX), however, causes heavy performance penalties, which especially increase as XI frequencies get higher with more CP's, due to the delay by clearing procedures.
From workload analysis it has been observed that when XIEX occurs it is very rare for the remote CP (owning the line L) to generate a store in a small time interval around the XIEX event. Most of the modifications on shared lines tend to occur over tens of references away from the actual ping pong point. As a result, upon an XIEX activity, the copy of the line in the fast shared storage is most likely to be valid for the requesting CP to use even before the clearing process is done. As a result, in such environment, heavy penalties from the clearing procedures for an XIEX are mostly unnecessary and unproductive.
Another known technique in modern processor design is conditional instruction execution based on branch prediction. With such design instruction streams may be fetched for decode and execution based on prediction of branch instruction outcome. In case instructions are initiated incorrectly based on wrong prediction they can be aborted later. Prior to the confirmation of an instruction, any store request resulted from the conditional execution will be held in a Pending Store Stack (PSS) for final release upon finish. Both instruction finishes and pending store releases are done in the order of architectural sequence, although instructions may be executed out of incoming sequence prior to completion. When a conditional instruction stream is aborted all the relevant instruction queue and pending stores in PSS are reset properly. There is no known art that allows instructions be executed conditionally based on storage data that is possibly invalid due to cache coherence reasons.
There is no known art directed to minimizing the delays caused by XIEX through anticipatory data access. All known methods of MP cache design allow a CP to access a cache line only when the line has already being cleared for architecture consistency. Upon XIEX situation the requesting CP can only access the cache line only when the remote CP holding the EX state on the line has released it EX control and then allow the line to be fetched to the requesting CP cache.