The use of caches for performance improvements in computing systems is well known and extensively used. A cache is a high speed buffer which holds recently used memory data. Due to the locality of references nature for programs, most of the access of data may be accomplished in a cache, in which case slower accessing to bulk memory can be avoided.
In typical high performance processor designs, the cache access path forms the critical path. That is, the cycle time of the processor is determined by how fast cache accessing can be carried out.
A cache may logically be viewed as a table of data blocks or data lines, in which each table entry covers a particular block or line of memory data. (Hereinafter the storage unit for a cache will be referred to as a line rather than a block.) The implementation of a cache is normally accomplished through three major portions: Directory, Arrays and Control. The directory contains the address identifiers for the cache line entries, plus other necessary status tags suitable for particular implementations. The arrays (sometimes called cache memory herein) store the actual data bits, with additional bits for parity checking or for error correction as required in particular implementations. The control circuits provide necessary logic for the management of cache contents and accessing. Upon an access to the cache, the directory is looked up to identify the residence of the requested data line. A cache hit results if it is found, and a cache miss results otherwise. Upon a cache hit, the data may be accessed from the array if there is no prohibiting condition (e.g., key protection violation). Upon a cache miss, the data line is normally fetched from the bulk memory and gets inserted into the cache first, with the directory updated accordingly, in order to satisfy the access though the cache. Since a cache only has capacity for a limited number of line entries and is relatively small compared with the bulk memory, replacement of existing line entries is often needed. The replacement of cache entries is normally based on algorithms like the Least-Recently-Used (LRU) scheme. That is, when a cache line entry needs to be replaced, the line entry that was least recently accessed will be preferred.
In order to facilitate efficient implementations, a cache is normally structured as a 2-dimensional table (FIG. 1). The number of rows is called the set-associativity, and each column is called a congruence class. For each data access, a congruence class is selected using certain memory address bits of the access, and the data may be accessed at one of the line entries in the selected congruence class if it hits there. It is usually too slow to have the cache directory searched first (with parallel address compares) to identify the set position (within the associated congruence class) and then to have the data accessed from the arrays at the found location. Such sequential processing normally requires 2 successive machine cycles to perform, which degrades processor performance significantly. A popular approach, called late-select, achieves a directory search and array data accessing in one cycle as follows. Consider the fetch of a data unit (e.g., a doubleword) by an execution element. Without the knowledge of the exact set position for access, the array control will retrieve candidate data units from lines at all set positions in the congruence class first, while the directory is looked up. Upon a cache hit, the directory control signals the final selection of one of those retrieved data units and sends it to the requesting execution element. Although the conventional late select technique allows much overlap between directory look up and array access, the final data selection can only be done after the directory search is done and the results are passed to the selection unit. Another deficiency of the late select method is that multiple data units accessed out of arrays can only cover at most one actually useful data unit, with the rest of the accessed units being wasted. In higher performance processors it is often critical for the cache to support multiple independent accesses from I/E-units, for which wasting I/O on futile array accessing becomes a design bottleneck.
Another aspect that complicates cache design is the commonly employed virtual addressing architecture in computer systems. In a virtual memory system (e.g., the IBM/390 architecture) each user process may have the view that it has its own virtual address space. Upon execution of programs the operating system may dynamically allocate real memory pages (e.g., 4 kilobytes per page) to more actively accessed virtual address pages. When a page accessed from a program does not have a real memory page allocated for it, an exception (page fault) condition will occur and triggers the operating system to properly allocate a real memory page frame. Page fault processing is normally associated with a very high performance overhead and often requires data accessing from slower backing devices like disks. However, due to the strong nature of program locality, a reasonable operating system can maintain a very low page fault rate during program executions. The operating system normally maintains the real page allocation information in architecturally specific software tables. Typically a 2-level translation table structure with segment and page tables are used for this purpose. Each program space has its own segment table, in which each entry points to a page table. At a page table each entry records the real page allocation information, plus some other status tags needed for particular architectures.
The operating system manages such translation tables according to its design algorithms. One consequence of the employment of virtual addressing is that the same virtual page address from different program address spaces may not be logically related and may be allocated at different real page frames in the storage. Furthermore, in architectures like the IBM/390, the same real page frame may be accessed through different virtual addresses from different programs or processors. With all these architectural requirements, most systems require a step called virtual address translation for processing storage accesses from processors. Virtual address translation translates a virtual page address into a real page address. A page fault exception is triggered if the real page frame is not allocated, for which the operating system will update the translation information when allocation is complete and then allow the faulted program to resume execution.
In most modern systems, hardware facilities are used for speeding up the virtual address translation process. Typically a Translation Lookaside Buffer (TLB) is employed for each processor. A TLB is a hardware directory table that records the translation information for actively accessed virtual pages. Due to the locality nature of program addressing, a relatively small TLB (e.g., with 64-1024 page entries) can capture the translationinformation for great majority (e.g., over 99.95%) of storage accesses from a processor. Only upon a TLB miss condition (i.e., when the TLB cannot cover the particular storage access) will a slower translation (e.g., through microcode or the operating system) be activated. For efficiency in hardware implementation, a TLB is normally structured as a set-associative table like the cache directory. For a given virtual page address (including certain program space identifiers), the hardware uses certain address bits (and other information specific to a particular design) to derive a congruence class. Within the congruence class, the hardware performs a parallel search of the entries and identifies the results of translation.
In many processor designs, a storage access needs to go through TLB translation prior to the final resolution of a cache access. In most modern designs the TLB look-up is carried out in parallel with the cache directory search, with their results merged for final late-select of the array data. FIG. 2 depicts such a design. Such requirement of multiple directory searches for the resolution of cache accessing has been a source of complexity for the optimization of many cache designs. All such complications are due to the fact that, when a processor issues a storage access request, the cache array cannot determine the exact location of data without knowing the results of conventional directory look-ups, which produces signal delay. Unfortunately, due to various architectural and machine organizational reasons, such exact location for data access cannot be obtained easily. One side effect of the conventional late-select mechanism is the wasteful array I/O's used in retrieving multiple data units for the possible final selection of at most one for the processor's use. As a result, it often causes difficulties in supporting concurrent multiple (independent) cache accesses in very high performance computers.
There have been many design proposals for implementing caches effectively. FIG. 3 outlines the IBM/3090 design of a 64 kilobyte (KB) processor cache for 31-bit logical addressing. The cache is 4-way set-associative with 128 congruence classes. The line size is 128 bytes. There is a cache directory DIR, cache memory data arrays ARR, and a 2-way set-associative TLB. The processor I/E-units and microcode issue storage access requests by a logical address. The logical address can be either virtual or real, depending upon the current mode of addressing at the processor. The more complicated case will be described for a doubleword (8 bytes)fetch request with virtual address from an l/E-unit. Among the bits 18-24 used for selecting the congruence class, two bits (18-19) are part of the page address. It can happen that, due to unpredictable translation results, these two bits get translated to 2 real address bits in any of the four possible combinations. Among the four congruence classes that may possibly contain the line being accessed, the one determined by the address bits in the currently accessed logical address is called the principal congruence class (PCC), and the other three are called synonym congruence classes. Although program locality will cause a great majority of cache accesses to hit the principal congruence class, there are still chances that the accessed line belongs to one of the other (synonym) congruence classes. This is the so-called synonym problem. In the IBM/3090 system design, the following steps are carried out in parallel:
1. Bits 18-31 of the logical access address are passed to the ARR control. Bits 18-24 are used to determine the principal cache congruence class. Then a doubleword (as indicated by bits 25-28) is read out of the cache arrays from each of the four line entries in the principal congruence class. These four doublewords will not be sent out to the requesting I/E-unit untill a late-select signal is received. PA1 2. Bits 18-24 are sent to DIR for cache directory look-up. Each DIR entry records the real address for the associated line. All 16 directory entries of the principal and synonym congruence classes are read out. PA1 3. Certain virtual address bits (not elaborated here) are used by the TLB to select the congruence class, from which the real address translation information of the 2 TLB entries are read out. PA1 Principal Congruence Class (PCC) Hit--A signal is sent to the late-select logic to gate the selected doubleword to the requesting I/E-unit. The access is complete. PA1 Synonym Congruence Class Hit--Proper steps will be taken to have the doubleword accessed from the synonym congruence class through later array fetching. This will result in longer delays to the access.
The 16 real line addresses read out of the cache directory are then merged with the 2 real addresses read out of the TLB for address match via 32 comparators. (There is other tag matching involved and not elaborated here.) When it is found that the translated address of the accessed line matches one of the cache directory real addresses, a cache hit condition results. Otherwise a cache miss occurs and triggers cache miss processing. Upon a cache hit situation the congruence class containing the line may or may not be in the principal congruence class. The following then is carried out by the cache control:
In the cache miss situation, the cache control will request a copy of the line from main storage. When the line comes back it will be placed in an allocated cache entry in the principal congruence class.
The IBM/3090 cache design reveals the following deficiencies. First of all, 32 comparators are used to resolve synonym conditions efficiently. Even more comparators will be required if the cache size expands or if the TLB set-associativity increases. The second problem is the longer cache access path due to the waiting of late-select for directory read/compare results.
There have been designs that employ certain types of prediction methods to avoid the above mentioned inefficiencies. The best known method is the direct-map cache design. A direct-map cache is one with a set-associativity of 1. Since there is only one line entry at each cache congruence class, cache access possibilities is very limited. FIG. 4 describes a direct-map cache design by modifying the IBM/3090 approach. Similar to the lBM/3090 design, upon a cache miss the line is placed in the principal congruence class. For each logical address issued by the I/E-unit, the cache control extracts needed bits to select the principal congruence class, and the data is tea there directly to the requesting unit. In parallel, cache directory and TLB look-ups are done. In case of a cache hit to the principal congruence class, the directory compare logic will send a signal to the requesting I/E-unit to complete the access. Otherwise (a cache miss or a hit to a synonym congruence class) a signal is sent to the requesting unit to cancel the data it received from ARR and triggers proper actions. The synonym problem still exists as in the IBM/3090 design. There have also been real address cache designs that, upon a cache miss, inserts a cache line into the congruence class determined by the real address bits after translation. Such approach, however, loses the parallelism between array accessing and directory searches. The most serious drawback of the direct-map cache design is the poor cache hit ratio. It is well-known that, For the same size cache, the cache hit ratio significantly improves as the set-associativity increases from 1 to 2 to 4.
Another lately proposed technique for prediction based cache design is the MRU cache design of J. H. Chang, H. Chao, and K. So, in "Cache Design of A Sub-Micron CMOS System/370," Proc. 14th Symposium on Computer Architecture, 1987, pp. 208-213. Cache replacement is normally managed on a per congruence class basis. Within each congruence class, there is a most-recently-used (MRU) entry and a least-recently-used (LRU) entry as indicated by proper replacement status tag. Due to program locality, a cache access is most likely to hit to the MRU line entry. The LRU entry is the one chosen for replacement when a new line needs to be inserted into the congruence class. The MRU-cache approach logically views the MRU lines of the whole cache as a direct-map cache. The basic principle is that, whenever a congruence class is determined to be accessed, the data will be retrieved from the MRU line entry on a prediction basis. The confirmation or cancellation of the access, based on directory compare results, will operate similar to the direct-map cache approach. The MRU-cache design proposed a real-address cache, in which a missed line will be inserted to the congruence class associated with the real address bits (after translation). In order to facilitate the determination of a cache access by virtual address, the MRU-cache design applies similar techniques to predict the address translation information. That is, for a given virtual address, the real address bits at the MRU entry of the associated TLB congruence class are read out and passed to the array control before the real address is verified as the actual translation. The predicted real address bits from the TLB are used by the array control to select the (predicted) congruence class For MRU entry data read, while in parallel the TLB unit does comparisons to determine the correctness of the real address prediction. FIG. 5 depicts the MRU-cache design.
Compared with the direct-map approach, the MRU-cache design reduces cache miss probability by allowing more than 1-way set-associativity. However, the MRU prediction provides worse accuracy for cache access prediction. For a direct-map cache, there is a 100% accuracy of prediction when a cache hit occurs (cache misses cannot be satisfied anyway). But the accuracy For MRU prediction is limited by how likely the accesses hit to the MRU entries. Consider a 64 KB cache described in the IBM/3090 design example. For a typical commercial workload a little over 90% of the actual cache hits can be resolved correctly by the prediction. The accuracy of prediction can become worse for different workloads. Also, simulation studies have showed that, when the MRU-cache approach is applied to data caches (i.e., caches that handle only operand accesses and not instruction code fetching) the accuracy of MRU entry prediction becomes much worse. Another deficiency of the MRU-cache design is the requirement of accessing the TLB (for the MRU entry) prior to the array accessing. This causes certain delays on the cache access critical path. Furthermore, complexity is involved in implementing it properly. Since a TLB is a relatively larger directory, it normally cannot be placed very close to the cache arrays physically. Another consequence of the TLB size is the expense in implementing a TLB directory with Fast circuits. Both factors make it difficult to optimize the timing of the real address prediction path, and hence the cache access critical path, in high speed computers.
The weaknesses of the prediction methods adopted in the direct-map and the MRU-cache approaches come from the Fact that they both are carried out based on physical structures of the cache or the TLB. The direct-map design achieves ultimate simplicity and prediction accuracy by flattening the physical cache geometry (to a 1-dimensional structure), and hence increases cache misses. The MRU-cache approach requires that the predictions (for TLB and cache) be based on the physical MRU entries, and hence loses accuracy and causes difficulties in implementation.
The real essence of a good prediction approach to cache accessing is to employ proper histories to achieve high accuracies and efficient implementations. In order to access a cache with 2-dimensional structure two parameters need to be determined: 1) the congruence class, and 2) the line entry position (i.e., set position) within the congruence class. Both parameters may be accurately predicted with history tables or other means that are effectively implementable and independent of the actual cache geometry. A similar principle applies to the prediction of real address translations. There is no known prior art that utilizes this concept and provides effective cache accessing with flexibility on implementations.