The use of caches for performance improvement in computing systems is well known and extensively used. A cache is a high speed buffer which holds recently used memory data. Due to the locality of references by programs (i.e., the tendency for programs to reference locations in memory that have addresses which are close together), most of the accesses to memory data may be accomplished by access to a cache, in which case slower accessing to bulk memory can be avoided.
In a typical high performance processor design, the cache access path forms the critical path. That is, the cycle time of the processor is limited by how fast cache accesses can be carried out.
A cache may be viewed logically as a 1-dimensional or 2-dimensional table of data blocks or lines, in which each table entry stores a particular block or line of memory data. Hereinafter the term cache "line" will be used to refer to a cache data storage unit, but the term cache "block" is considered synonymous. The implementation of a cache is normally accomplished through three major functions: a directory function, a memory function (sometimes called the arrays) and a control function. The cache directory contains the address identifiers for the cache line entries, plus other necessary status tags suitable for particular implementations. The cache memory or arrays store the actual data bits (which may represent an operand or an instruction), plus additional bits for parity checking or for error correction as required in particular implementations. The cache control circuits provide necessary logic for management of the cache contents and accessing.
Upon an access to the cache, the directory is looked up to identify whether or not the desired data line resides in the cache. A cache hit results if the desired data line is found in the cache, and a cache miss results otherwise. Upon a cache hit, the data may be accessed from the array if there is no prohibiting condition (e.g., key protection violation).
Upon a cache miss, the data line is normally fetched from the bulk memory and gets inserted into the cache first, with the directory updated accordingly, in order to satisfy the access from the cache. Since a cache only has capacity for a limited number of line entries and it is relatively small compared with the bulk memory, replacement of existing line entries is often needed. The replacement of cache entries is normally based on an algorithm, such as the Least-Recently-Used (LRU) scheme. In the LRU scheme, when a cache line entry needs to be replaced, the line entry that was least recently accessed will be preferred for replacement.
In order to facilitate efficient implementations, a cache is normally structured as a 2-dimensional table 230 (see FIG. 1). The number of rows is called the set-associativity (i.e., 4-way set associative in FIG. 1), and each column is called a congruence class. For each data access, a congruence class is selected using certain memory address bits 112 of the access address 250, and the data may be accessed at one of the line entries 116a-d in the selected congruence class 116 if it hits there.
It is often considered too slow to have the cache directory searched first (even with parallel address compares) to identify the set position a, b, c or d (within the selected congruence class 116) and then have the data accessed from the arrays only at the identified location. Such sequential processing normally requires 2 successive machine cycles to perform, which degrades processor performance significantly. A popular approach instead, called late-select, achieves the directory search and array data accessing in one cycle as follows. Consider the fetch of a data unit (e.g., a doubleword) by an execution element. Without the knowledge of the exact set position For access, the array control retrieves candidate data units from lines at all set positions in the congruence class immediately, while the directory is looked up. Upon a cache hit, the directory control signals the final selection of one of those retrieved data units and sends it to the requesting execution element.
Another aspect that complicates cache design is the commonly employed virtual addressing architecture in almost all modern computer systems. In a virtual memory system (e.g., IBM/390 architecture) each user process may have the view of its own virtual address space. Upon execution of programs, the operating system dynamically allocates real memory pages (e.g., 4 kilobytes per page) to more actively accessed virtual address pages. When a page accessed from a program does not have a real memory page allocated for it, an exception (page fault) condition will occur and triggers the operating system to properly allocate a real memory page frame. Page fault processing is normally associated with a very high performance overhead and often requires data accessing from slower backing devices like disks. However, due to the a strong program locality characteristic, a reasonable operating system can maintain a very low page fault rate during program executions.
The operating system normally maintains the real page allocation information in architecturally specific software tables. Typically a 2-level translation table structure with segment and page tables is used for this purpose. Each program space has its own segment table, in which each entry points to a page table. At a page table, each entry records the real page allocation information, plus any status tags needed for a particular architecture. The operating system manages such translation tables according to its design algorithms.
One consequence of the employment of virtual addressing is that the same virtual page address from different program address spaces may not be logically related and allocated at different real page frames in the storage. Furthermore, in architectures like IBM/390, the same real page frame may be accessed through different virtual addresses from different programs or processors.
With all these architectural requirements, most systems require a step called virtual address translation for processing storage accesses from processors. Virtual address translation translates a virtual page address into a real page address. A page fault exception is triggered if the real page frame is not allocated, for which the operating system will update the translation information when allocation is complete and then allow the faulted program to resume execution.
In most modern systems hardware facilities are used for speeding up the virtual address translation process. Typically a Translation Lookaside Buffer (TLB) is employed for each processor. A TLB is a hardware directory table that records the translation information for a act of actively accessed virtual pages. Due to the program locality nature, a relatively small TLB (e.g., with 64-1024 page entries) can capture the translation information for a great majority (e.g., over 99.95%) of storage accesses from a processor. Only upon a TLB miss condition (i.e., when the TLB cannot cover the particular storage access) will a slower translation process (e.g., through microcode or operating systems code) be activated.
For efficiency of hardware implementation, a TLB is normally structured as a set-associative table like the cache directory. For a given virtual page address (including one or more program space identifiers), the hardware uses certain address bits (and other information specific to a particular design) to derive a congruence class. Within the TLB congruence class, the hardware performs a parallel search of the entries and identifies the results of translation if there is a hit.
In many processor designs a storage access needs to go through a TLB translation prior to the final resolution or cache access. For example, in the IBM/3090 system design, the TLB look-up is carried out in parallel with the cache directory search, with their results merged for final late-select of the array data. FIG. 2 depicts such a design. The address of the memory access requested by a processor I/E-unit is called a logical address. Depending upon the particular addressing mode (real or virtual) of the current processor status, the logical address can be a real address or a virtual address. For the same physical data, the congruence class selected through use of a real address can be different from the congruence class selected through use of a virtual address. Furthermore, in some architectures (e.g., IBM/390), the same physical memory page can be concurrently accessed through arbitrarily different virtual (page) addresses at different cache congruence classes. In the IBM/3090 system design, although cache lines are placed primarily based on the logical address bits of the processor access that caused the cache miss line fetch, many comparators are used for the directory search of each cache access in order to timely determine the possibility or cache hit(s) to various alternative cache locations.
This requirement for multiple directory searches for the resolution of cache accessing has been a source of complexity for the optimization of many cache designs. One approach for avoiding the synonymous ambiguity between virtual and real addresses is to supply only real addresses to the cache access control. Such an approach, however, normally requires an access to a TLB in order to retrieve translation information for any virtual address. On the other hand, accessing a TLB is usually a slower path. In modern computers relatively sizeable TLB's (e.g., with 256-1024 page entries) are used in order to obtain high hit ratios, and hence it is rather expensive and difficult to implement TLB's with fast circuits (e.g., using shift-register latches, or using ECL circuits in BiCMos designs). In addition, the size of TLBs often prevents the placing of the TLB physically close to the critical components in the cache access path and results in delays in signal passing. Consequently, the approach of only supplying real addresses for cache accessing is often prohibited in implementation due to constraints on critical path timing.
There have been many design proposals for implementating caches effectively. FIG. 3 outlines the IBM/3090 design of a 64 kilobyte (KB) processor cache for 31-bit logical addressing. This cache is 4-way set-associative with 128 congruence classes. The line size is 128 bytes. There is a cache directory 220, cache memory data arrays 230 and a 2-way set-associative TLB 210. The processor I/E-units and microcode issue a storage access by a logical address. The logical address 250 can be either virtual or real, depending upon the current mode of addressing at the processor.
The more complicated case for a doubleword (8 bytes) fetch request with a virtual address will be described. Seven bits 18-24 are used for selecting the congruence class. Of these seven bits, two bits 18 and 19 are part of the page address. It can happen that, due to unpredictable translation results, these two bits can get translated to 2 real address bits in any of the four possible combinations. Among the four congruence classes that may possibly contain the data line being accessed, the one determined by the address bits in the currently accessed logical address is called the principal congruence class (PCC), and the other three are called synonym congruence classes.
Although program locality will cause a great majority of cache accesses to hit the principal congruence class, there is still a chance that the accessed line might belong to one of the other (synonym) congruence classes. This is the so-called synonym problem.
In the IBM/3090 system design, the following steps are carried out in parallel:
1. Bits 18-31 of the logical access address are passed to the cache memory array control. Bits 18-24 are used to determine the principal cache congruence class. Then a doubleword (as indicated by bits 25-28) is read out of the cache arrays from each of the four line entries in the principal congruence class. These four doublewords will not be sent out to the requesting I/E-unit until a late-select signal is received. PA1 2. Bits 18-24 are sent to the cache directory for look-up. Each directory entry records the real address for the associated line. All 16 directory entries or the principal and synonym congruence classes are read out. PA1 3. Certain virtual address bits (not elaborated here) are used for the TLB to select the congruence class, from which the real address translation information of the 2 TLB entries are read out. PA1 Principal Congruence Class (PCC) Hit--A signal is sent to the late-select logic to gate the selected doubleword on a bus to the requesting I/E-unit. The access is complete. PA1 Synonym Congruence Class Hit--Proper steps are taken to have the doubleword accessed from the synonym congruence class through later array fetching. This will result in longer delays to the access.
The 16 real line addresses read out of the cache directory are then merged with the 2 real addresses read out of the TLB for address match via 32 comparators. (There is other tag matching involved and not elaborated here.) When it is found that the translated address of the accessed line matches one of the cache directory real addresses, a cache hit condition results. Otherwise a cache miss occurs and triggers a cache miss routine. Upon a cache hit situation, the congruence class containing the line may or may not be the principal congruence class. The following then is carried out by the cache control:
In a cache miss situation the cache control will request a copy of the line from main storage. When the line comes back it will be placed in an allocated cache entry in the principal congruence class.
A major complexity in the IBM/3090 cache design is associated with resolution of synonyms. The comparator component CMP 128 utilizes 32 comparators for a timely decision when there is a principal congruence class miss situation. The number of comparators will grow linearly to the number of congruence classes and the set-associativity of the TLB. For instance, if the cache size grows to 256K by quadrupling the number of congruence classes and if the set-associativity of the TLB increases to 4, the total number of comparators required in the CMP unit will increase to a rather impractical 256.
As discussed earlier, implementation of a real address based cache often suffers from the slow path of resolving real address translation through a TLB. A recently proposed approach For implementing a real-address based cache is the MRU-cache design described by J. H. Chang, H. Chao and K. So in "Cache Design of A Sub-Micron CMOS System/370," Proc. 14th Symposium on Computer Architecture, at pp. 208-213 (1987), which is depicted in FIG. 4.
In the MRU-cache design approach, upon a cache access with a virtual address, the selection of the real address congruence class is based on a certain prediction. The TLB is normally structured as a typical set-associative directory, in which replacements are managed on a per congruence class basis. For each virtual address, the associated congruence class in the TLB is determined by certain (typically higher order) bits in the page address portion. Within each congruence class, there is typically a most-recently-used (MRU) entry and a least-recently-used (LRU) entry as indicated by a replacement status tag. The LRU entry is the one chosen for replacement when a new page entry needs to be inserted into a congruence class. Due to the program locality characteristic, successive virtual addresses issued from the processor are likely to hit the MRU entries in TLB. The MRU-cache approach utilizes such locality behavior and predicts that the translation of a virtual address will be from the MRU entry in the associated congruence class of the TLB.
Key aspects of real address prediction in the MRU-cache approach are as follows. For a given virtual address issued for a cache access by the processor, the TLB congruence class is determined (via certain virtual page address bits) as usual, and the real page address bits associated with the MRU entry of the congruence class are read out as a prediction. Among the retrieved real address bits, certain bits necessary for determining the (real) cache congruence class are sent out to the cache array control for cache data retrieval. In the meantime the virtual address is compared with all the entries in the TLB congruence class as usual to precisely determine the actual translation results. Due to the high hit ratio on MRU entries in a TLB, the cache congruence class selections based on this prediction are most likely correct. Upon an incorrect prediction, as determined from TLB compares later during the cache access cycle, the current cache access is simply aborted and re-issued in a subsequent cycle. When a cache access is re-issued due to the incorrect prediction of the cache congruence class selection, the TLB can supply the correctly translated real bits for a correct cache congruence class selection.
Compared with conventional real address cache approaches, the MRU-cache approach allows slightly faster selection of cache congruence classes by not waiting for the final compare results from the TLB. However, the MRU-cache design suffers from the following drawbacks. First of all, the MRU-cache requires retrieval of real address bits from the TLB in the cache access critical path. Prior to the reading of such real address bits, the slow cache array access cannot be started.
One possible alternative approach in relieving this timing burden on the cache access cycle might be to move the reading of the TLB/MRU information to an earlier cycle (typically the logical address generation cycle). However, this would simply pass the timing problem on to the earlier cycle instead.
A second problem for the MRU-cache approach is related to the accuracy of prediction in certain designs. The accuracy of prediction relies on the probability of hits to MRU entries in the TLB for virtual addresses. Consider a TLB with 256 page entries. The miss ratio for the entire TLB could be significantly reduced (e.g., by over 20%) if the TLB set-associativity were to be increased from 2 to 4, particularly for programs which are causing severe congestion in a few TLB congruence classes. With a 4-way set-associative TLB there are only 64 MRU page entries. In a typical commercial workload the hit probability of operand accesses (fetches and stores) to these 64 MRU entries in the TLB is below 97.5%. Such a hit probability may be improved (e.g., to around 99%) when the TLB set-associativity decreases to 2, at the expense of more misses for the entire TLB.
It would be desirable to be able to obtain some or all of the real address bits for a virtual page address faster, more accurately and with more cache accessing flexibility than is provided with this MRU-cache approach.