1. Field of the Invention
The invention relates generally to caches in a multiprocessor environment and more particularly to a method for fetching lines of data from a cache that are potentially dirty.
2. Description of the Prior Art
Modern high performance stored program digital computers conventionally fetch instructions and data from main memory and store the fetched instructions and data in a cache memory. A cache is a local memory that is typically much smaller and much faster than the main memory of the computer. Virtually all high performance digital computers use a cache and even some commercially available microprocessors have local caches.
Caches were developed because it has not been possible to build extremely large memories at a reasonable cost that operate having an access time commensurate with modern day pipelined processors. It is, however, possible to build less expensive, small memories that can keep up with the processor. Since an instruction and its needed data in the cache can be immediately accessed by the processor, caches usually speed up computer performance.
Normally a processor (CP) accesses main storage (MS) data through its cache. A cache is usually organized as a 2-dimensional array, in which each array entry contains a fixed size block of MS data called a line. The directory of a cache describes the addressing information for its lines. When an instruction or data accessed by the CP is found in the cache via directory lookup, the access is said to hit the cache. Otherwise the access misses in the cache. Upon a cache miss the cache control will generate a request to move the requested line into the cache. When a line is inserted into the cache it may replace an existing line. A cache is normally managed with certain replacement strategies such as the well known Least-Recently-Used (LRU) replacement algorithm. Depending on the cache design, the replacement of a line from cache may require update of the replaced contents to MS in order to maintain consistency of the storage.
Caches can be used in both multiprocessor and uniprocessor systems. In the type of multiprocessor (MP) system known as the tightly coupled multiprocessor system in which several CPs have their own caches that share a common operating system and memory, additional problems arise since it is necessary for each processor's cache to know what has happened to lines that may be in several caches simultaneously. In a multiprocessor system where there are many CPs sharing the same main storage, each CP is required to obtain the most recently updated version of data according to architecture specifications when access is issued. This requirement necessitates constant monitoring of data consistency among caches, often known as the cache coherence problem.
There are various types of caches in prior art multiprocessor systems. One type of cache is the store through (ST) cache as described in U.S. Pat. No. 4,142,234 assigned to the assignee of the present invention. Such a cache may be found in the IBM System/370 Model 3033 MP. ST cache design does not interfere with the CP storing data directly to the main storage (or second level cache) in order to always update changes of data to main storage. Upon the update of a store through to main storage appropriate cross invalidate actions may take place to invalidate possible remote copies of the stored cache line. The storage control element (SCE) maintains proper store stacks to queue the MS store requests and standard communications between a buffer control element (BCE) and the SCE will avoid store stack overflow conditions. When the SCE store stack becomes full the associated BCE will hold its MS stores until the condition is cleared.
Another type of cache design is the store-in cache (SIC). SICs are described in U.S. Pat. Nos. 3,735,360 to Anderson et al. and 3,771,137 to Warner et al. A SIC cache directory is also described in detail in U.S. Pat. No. 4,394,731 to Flusche et al. in which each line in a store-in cache has its multiprocessor shareability controlled by an exclusive/read only (EX/RO) flag bit. The main difference between ST and SIC caches is that, all stores in SIC are directed to the cache itself (which may cause a cache miss if the stored line is not in the SIC cache). It is also proposed in U.S. Pat. No. 4,503,497 that data transfers upon a miss fetch can take place through a cache to cache transfer bus (CTC) if a copy is in the remote cache. A SCE is used that contains copies of the directories in each cache. This permits cross interrogate (XI) decisions to be resolved at the SCE. Usually cache line modifications are updated to main storage only when the lines are replaced from the cache.
A cache line that is RO is valid only in a read only state. The processor can only fetch from the line. Stores into the line are prohibited. A RO cache line may be shared simultaneously among different caches.
A cache line that is EX is valid but only appears in the cache of one processor. It is not resident in any other (remote) cache. Only the (owning) processor is allowed to store into the line.
A cache line that is CH indicates that not only is the line valid and EX but that it has been stored into (i.e., CHanged). That is the copy in main storage may not be up to date. When a CH line is replaced a copy is sent to main storage via a castout action.
An INV cache line is a line that is invalid.
In a typical computer system a first CP, P.sub.1, may access an instruction or data from a line in a cache. Its own cache will be checked and if the particular line requested is read only (RO) it may make a store request, and via the storage control element (SCE), make that line exclusive (EX). Once the line is made exclusive, the storage control element (SCE) indicates to the other caches that the line is invalid and the first cache will be free to write into that line.
In the multiprocessor cache environment a problem known as the Cross-Interrogate (XI) problem occurs as a result of relatively close accesses of the same data line by different processors (CP's). For instance, if a line is modified by CP P.sub.1 other CP's may fetch a dirty copy of a line L if line L is fetched from memory before the modifications by f.P sub 1 are updated to the memory.
For illustration purposes, in the following a multiprocessor system is considered in which there are N CP's {P.sub.i .vertline.1.ltoreq.i.gtoreq.N} and a private cache C.sub.i for each P.sub.i. For purposes of the present discussion a memory hierarchy in which shared main memory is the one below private caches is assumed.
One major problem with ST cache design is the traffic generated by all CPs in the system. A trend, however, in future MP systems is the availability of high performance shared storage among all processors. An example of such fast shared storage is the shared second level cache (L2). With the provision of such high performance shared storage it becomes attractive to implement MP systems with ST caches while still supporting more CPs. Yet another problem with ST design is the busy store handshaking with SCE problem as illustrated in U.S. Pat. No. 4,142,234. In such a design the data item being stored by a CP cannot be fetched by the same CP till the CP receives acknowledgement of the store from the SCE. Such busy handshake not only slows down the processor pipeline operation but also makes it difficult for the SCE to efficiently serialize all the stores when there are more CPs.
One known approach to the busy store handshake problem for ST design is to employ the EX/RO states from SIC design. Consider a store-thru cache MP environment in which at any moment, a cache line may have any one of the three states INV, RO or EX. INV indicates invalidity. RO indicates the possibility of simultaneous access of different copies of the line from more than one CP. EX guarantees that no other cache can have a copy of the line for access. A typical implementation of this multiprocessor cache scheme is as follows. Upon the fetch of a line L the line is brought into the cache with either RO or EX state (depending on the particular instance and the particular cache scheme). When, however, a store is requested on a line, the system should guarantee the EX state to the line before the line can be stored into. This granting of the EX state may involve XI actions to invalidate copies of the line from other caches. When a CP, for example, P.sub.1, stores into a line L held RO in its local cache, its buffer control element (BCE) will request EX status for L before the store can be putaway into the cache. In a typical MP system, for example, the one described in U.S. Pat. Nos. 4,394,731 and 4,503,497, the cache is blocked from subsequent accesses till the EX status is acquired in order to guarantee data coherence. In certain MP designs, such holding of cache access upon EX status request causes significant performance penalties.
The primary reason, in more conventional MP designs, to prevent subsequent cache accesses at a CP when its BCE is waiting for EX status of a line is due to the consideration that a subsequent fetch may become obsolete due to a store invalidate from remote CP. For example, consider an instruction stream &lt;..I.sub.i...I.sub.j &gt; at a CP. Assume that I.sub.i triggers a EX status request for a line L, and assume that I.sub.j fetches a doubleword A before the EX status is acquired for L. If, by the time EX status is acquired on L for the store from I.sub.i the line containing A is invalidated due to a store from a remote processor, the execution of I.sub.j may cause architecture violation due to its access of A. From workload analysis it has been observed that, in a typical design in which EX status can be acquired reasonably quickly, the chance for a CP to use remotely invalidated data during the window for EX status acquisition is rather slim. As a result, preventing a cache from being accessed while a CP is acquiring EX status of a line will most likely hold the CP execution unnecessarily and unproductively.
Another known technique in modern processor design is conditional instruction execution based on branch prediction. With such design instruction streams may be fetched for decode and execution based on prediction of branch instruction outcome. In case instructions are initiated incorrectly based on wrong prediction they can be aborted later. Prior to the confirmation of an instruction, any store request resulting from the conditional execution will be held in a Pending Store Stack (PSS) for final release upon finish. Both instruction finishes and pending store releases are done in the order of architectural sequence, although instructions may be executed out of incoming sequence prior to completion. When a conditional instruction stream is aborted all the relevant instruction queue and pending stores in PSS are reset properly.
There is no known art directed to minimizing the delays caused by EX status acquisition through anticipatory subsequent data access. All known methods of MP cache design allow a CP to access a cache line only when there is no pending EX status request. Before an ongoing EX status request is complete, the CP cache is prevented from being accessed.