1. Technical Field of the Invention
This invention generally relates to caches for computer systems, such as set associative caches and direct-mapped caches, and more particularly to addressing such caches with minimal increase in critical path delay.
2. Background Art
The use of caches for performance improvements in computing systems is well known and extensively used. See, for example, U.S. Pat. No. 5,418,922 by L. Liu for "History Table for Set Prediction for Accessing a Set Associative Cache", and U.S. Pat. No. 5,392,410 by L. Liu for "History Table for Prediction of Virtual Address Translation for Cache Access", the teachings of both of which are incorporated herein by reference.
A cache is a high speed buffer which holds recently used memory data. Due to the locality of references nature for programs, most of the access of data may be accomplished in a cache, in which case slower accessing to bulk memory can be avoided.
In typical high performance processor designs, the cache access path forms a critical path. That is, the cycle time of the processor is affected by how fast cache accessing can be carried out.
A cache may logically be viewed as a table of data blocks or data lines in which each table entry covers a particular block or line of memory data. (Hereinafter the storage unit for a cache may be referred to as a line or a block.) The implementation of a cache is normally accomplished through three major portions: directory, arrays and control. The directory contains the address identifiers for the cache line entries, plus other necessary status tags suitable for particular implementations. The arrays (sometimes called cache memory herein) store the actual data bits, with additional bits for parity checking or for error correction as required in particular implementations.
Cache control circuits provide necessary logic for the management of cache contents and accessing. Upon an access to the cache, the directory is accessed or "looked up" to identify the residence of the requested data line. A cache hit results if it is found in the cache, and a cache miss results otherwise. Upon a cache hit, the data may be accessed from the array if there is no prohibiting condition (e.g., protection violation.) Upon a cache miss, the data line is normally fetched from the bulk memory and inserted into the cache first, with the directory updated accordingly, in order to satisfy the access through the cache.
Since a cache only has capacity for a limited number of line entries and is relatively small compared with the bulk memory, replacement of existing line entries is often needed. The replacement of cache entries in a set associative cache is normally based on algorithms like the Least-Recently-Used (LRU) scheme. That is, when a cache line entry needs to be removed to make room for (be replaced by) a new line, the line entry that was least recently accessed will be selected.
In order to facilitate efficient implementations, a cache is normally structured as a 2-dimensional table. The number of columns is called the set-associativity, and each row is called a congruence class. (This row/column designation is traditional. However, the Liu patents, supra, rotate the cache 90 degrees and interchage the row and column designations.) For each data access, a congruence class is selected using certain address bits of the access, and the data may be accessed at one of the line entries in the selected congruence class if it hits there. It is usually too slow to have the cache directory searched first (with parallel address compares) to identify the set position (within the associated congruence class) and then to have the data accessed from the arrays at the found location. Such sequential processing normally requires two successive machine cycles to perform, which degrades processor performance significantly.
Another aspect that complicates cache design is the commonly employed virtual addressing architecture in computer systems. In a virtual memory system (e.g., the IBM System/390 architecture) each user process may have the view that it has its own virtual address space. Upon execution of programs the operating system may dynamically allocate real memory pages (e.g., 4 kilobytes per page) to more actively accessed virtual address pages. When a page accessed from a program does not have a real memory page allocated for it, an exception (page fault) condition will occur and trigger the operating system to properly allocate a real memory page frame. Page fault processing is normally associated with a very high performance overhead and often requires accessing data from slower backing devices like disks. However, due to the strong nature of program locality, a reasonable operating system can maintain a very low page fault rate during program executions. The operating system normally maintains the real page allocation information in architecturally specific software tables. Typically a 2-level translation table structure with segment and page tables is used for this purpose. Each program space has its own segment table, in which each entry points to a page table. At a page table each entry records the real page allocation information, plus some other status tags needed for particular architectures.
The operating system manages such translation tables according to its design algorithms. One consequence of the employment of virtual addressing is that the same virtual page address from different program address spaces may not be logically related and may be allocated at different real page frames in the storage. Furthermore, in architectures like that of the IBM System/390, the same real page frame may be accessed through different virtual addresses from different programs or processors.
With all these architectural requirements, most systems require a step called virtual address translation for processing storage accesses from processors. Virtual address translation translates a virtual page address into a real page address. A page fault exception is triggered if the real page frame is not allocated, for which the operating system will update the translation information when allocation is complete and then allow the faulted program to resume execution.
In most modern systems, hardware facilities are used for speeding up the virtual address translation process. Typically a translation lookaside buffer (TLB) is employed for each processor. A TLB is a hardware directory table that records the translation information for actively accessed virtual pages. Due to the locality nature of program addressing, a relatively small TLB can capture the translation information for a great majority of storage accesses from a processor. Only upon a TLB miss condition, when the TLB cannot cover the particular storage access, will a slower translation, using the translation tables from memory, be activated. For efficiency in hardware implementation, a TLB is normally structured as a set-associative table like a cache directory. For a given virtual page address (including certain program space identifiers), the hardware uses certain address bits (and other information specific to a particular design) to derive a congruence class. Within the congruence class, the hardware performs a parallel search of the entries and identifies the results of translation.
In many processor designs, a storage access needs to go through TLB translation prior to the final resolution of a cache access. In a set associative design, the TLB look-up is carried out in parallel with the cache directory search, with the results merged for final late-select of the array data. This requirement of multiple directory searches for the resolution of cache accessing is a source of complexity for the optimization of many cache designs. When a storage processor issues a storage access request, the cache array cannot determine the exact location of data without knowing the results of conventional directory lookups, which produces signal delay. Such exact location for data access cannot be obtained easily. As a result of conventional late-select mechanisms, wasteful array I/O's must be used to retrieve multiple data units for the possible final selection of at most one data unit for the processor's use. Consequently, concurrent multiple, independent cache accesses in very high performance computers are difficult to support. See, for example, the design of the IBM/3090 4-way set-associative cache with 128 congruence classes as described in Liu U.S. Pat. No. 5,418,922, which requires several comparators to resolve synonym conditions efficiently, and a long cache access path due to the waiting of late-select for directory read/compare results.
Cache designs that employ prediction methods to avoid these inefficiencies of the prior art have been proposed. These include a direct-map approach and a MRU-cache approach.
The direct-map design is one with a set-associativity of 1, with only one line entry at each cache congruence class. The direct-map design achieves ultimate simplicity and prediction accuracy by flattening the physical cache geometry to a 1-dimensional structure, but in so doing usually increases cache misses.
The MRU-cache approach manages cache line replacement and set selection on a per congruence class basis. Within each congruence class, there is a most-recently-used (MRU) entry and a least-recently-used (LRU) entry. Due to program locality, a cache access is most likely to hit to the MRU line entry. The LRU entry is the one chosen for replacement when a new line needs to be inserted into the congruence class. Thus, typically, the MRU handles set selection, and the LRU handles line replacement.
The slot MRU-cache approach logically views the MRU lines of the whole cache as a direct-map cache: whenever a congruence class is determined to be accessed, the data will be retrieved from the MRU line entry on a prediction basis. It is a real-address cache, in which a missed line will be inserted into the congruence class associated with the real address bits (after translation.)
For a given virtual address, the real address bits at the MRU entry of the associated TLB congruence class are read out and passed to the array control before the real address is verified as the actual translation. The predicted real address bits from the TLB are used by the array control to select the predicted congruence class. The MRU-cache approach requires that the predictions for TLB and cache be based on the physical MRU entries, and hence loses accuracy and causes difficulties in implementation.
Liu U.S. Pat. No. 5,418,922 describes set prediction and Liu U.S. Pat. No. 5,392,410 describes a system using real address translation prediction for shortening the cache access critical path. A history table is used to predict virtual address translation information for a given access in a set-associative cache. When the prediction is correct, the performance is the same as if the location is known. In the case of a wrong prediction, the data access is aborted and reissued properly. However, Liu does not teach a mechanism for generating the address used to access the history table, and there is a need for a method to create that address with a minimum circuit delay.
In U.S. Pat. No. 5,392,410, Liu also describes a hashing function, which occurs after computation of the address, to randomize the entries in the history table. As this introduces an extra delay in the critical path, there is a need for a method that performs the hashing function which avoids that delay.
It is an object of the invention to improve cache operation by completely overlapping real address prediction with effective address generation.
It is a further object of the invention to improve cache operation by performing a hashing operation to randomize entries in the MRU without causing delay in the critical path.