1. Field of the Invention
This invention generally relates to techniques for reducing data misses in large cache memories in a multi-processor (MP) data processing system and, more particularly, to mechanisms for data prefetching in multi-processor caches based on store information.
2. Description of the Prior Art
High performance, MP computer systems are being developed to increase throughput by performing in parallel those operations which can run concurrently on separate processors. Such high performance, MP computer systems are characterized by multiple central processor (CPs) operating independently and in parallel, but occasionally communicating with one another or with a main storage (MS) when data needs to be exchanged. The CPs and the MS have input/output (I/O) ports which must be connected to exchange data.
In the type of MP system known as the tightly coupled multi-processor system in which each of the CPs have their own caches, there exist coherence problems at various levels of the system. More specifically, inconsistencies can occur between adjacent levels of a memory hierarchy. The multiple caches could, for example, possess different versions of the same data because one of the CPs has modified its copy. It is therefore necessary for each processor's cache to know what has happened to lines that may be in several caches at the same time. In a MP system where there are many CPs sharing the same main storage, each CP is required to obtain the most recently updated version of data according to architecture specifications when access is issued. This requirement necessitates constant monitoring of data consistency among caches.
A number of solutions have been proposed to the cache coherence problem. Early solutions are described by C. K. Tang in "Cache System Design in the Tightly Coupled Multiprocessor System", Proceedings of the AFIPS (1976), and L. M. Censier and P. Feautrier in "A New Solution to Coherence Problems in Multicache Systems", IEEE Transactions on Computers, Dec. 1978, pp. 1112 to 1118. Censier et al. describe a scheme allowing shared writable data to exist in multiple caches which uses a centralized global access authorization table. However, as the authors acknowledge in their Conclusion section, they were not aware of similar approaches as described by Tang two years earlier. While Tang proposed using copy directories of caches to maintain status, Censier et al. proposed to tag each memory block with similar status bits.
A typical approach to multi-processor (MP) cache coherence is as follows. When a processor needs to modify (store into) a cache line, it makes sure that copies of the line in remote caches are invalidated first. This is achieved either by broadcasting the store signal to remote processors (for instance, through a common bus connecting all processors) or by requesting for permission from a centralized storage function (for instance, the storage control element (SCE) in IBM 3081 systems). The process of invalidating a cache line that may or may not exist in remote processor caches is called cross-interrogate invalidate (XI-invalidate). There have been various design techniques proposed for the reduction of such XI-invalidate signals. For example, in IBM/3081 systems, exclusivity (EX) states at processor caches are used to record the information that the associated lines are not resident in remote caches and do not require XI-invalidate activities when stored into from the caches owning the exclusivity states.
One inherent overhead in conventional MP cache designs is the extra misses due to XI-invalidates. That is, a processor access to its cache may find the line missing, which would not have occurred if not XI-invalidated by a remote processor before the access. This problem is becoming more serious when large caches are used with more central processors (CPs). Simulation results indicate that such extra misses are mostly on data lines (D-lines), as opposed to instruction lines (I-lines). With large caches, miss ratios are rather satisfactory in a uni-processor (UP) environment. To reduce the extra misses due to remote stores, one approach is to prefetch D-lines that are potentially invalidated by remote CPs.