1. Field of the Invention
The present invention generally relates to computer memory systems and, more particularly, to a memory hierarchy which comprises a level one (L1) cache with access/cycle time equal to or faster than the processor cycle time and an L2 cache consisting of a directory and data array in which the L2 directory is accessed upon a miss to the L1 cache, the L2 cache requesting a block reload from the L3 cache of the hierarchy if a miss to the L2 cache occurs. The L2 cache is a DRAM, which can be many times larger than a SRAM implementation in the same technology, built in a manner that does not compromise overall system performance.
2. Background Description
An L1 cache is typically implemented in static random access memory (SRAM) technology for speed. L2 caches are typically also implemented in SRAM technology for speed, but the high cost limits the L2 capacity. A larger capacity could be obtained by using dynamic random access memory (DRAM) technology, but the compromise in speed is usually not acceptable for high performance systems.
An L2 cache must interface to a higher-level L1 cache to supply blocks (lines) for reloading any L1 misses, and simultaneously interface to a lower level, L3 cache when misses occur within the L2 cache. (In a multi-processor configuration, there are other interfaces as well.) These interfaces can all take place on one bus, or on independent buses, with the latter giving substantially better performance. While this disclosure is applicable to any such cases above, for simplicity, it will be embodied in a uniprocessor system with independent bus interfaces to the L1 and L3 levels of the memory hierarchy.
An L2 cache which interfaces to a high-performance processor L1 cache for the required L1 reloads, and to an L3 or main memory for the L2 accesses/reloads, has special requirements in terms of both speed and organization. Typically, all levels of a memory hierarchy below the L1 cache (L1 is considered the highest level, closest to the processor), will access the data arrays mainly on a block boundary. A miss in the L1 cache will request a reload of a full L1 cache line (block) which can be 64 to 256 bytes, with 128 being a typical, current value. Similarly, misses in L2 cache will request the reload of a block (line) from L3 cache or main memory and so on. In cases where a store-through policy is used, individual word, double word, or other logical unit smaller than a block or line, is stored in L2 cache whenever a store is performed in L1 cache, but this is seldom used, nor does it change the need for a block access on a miss/reload. In order to maintain high hit ratios and high performance (short reload start time), such an L2 cache would desirably be organized as a four-way, set-associative, late-select cache. But such an organization presents several fundamental problems. A typical late-select, set-associative cache accesses the congruence class of the directory simultaneously with the congruence class of the array. This means that four entries, i.e., four virtual addresses with other appropriate bits, are accessed out of the directory array, the virtual address compares are done on the periphery of the array for a match, and a late-select signal is generated, corresponding to the match.
Simultaneously with the directory access, the data array is accessed for four blocks (lines) which correspond to the same four entries accessed from the directory. These four blocks (lines) are latched in data-buffers at the edge of the data array. The late select signal then selects the line which corresponds to the matched virtual address. The access time requirements are that the directory access, compares, and late-select signal must be completed before or at the same time that the four lines are latched at the edge of the data array. Since the data array will be much larger than the directory, it will be much slower, so the late select signals from the directory are usually ready by the time the data array has latched the four blocks from the congruence class.
The above organizations apply equally to SRAM and DRAM L2 designs. However, the wide data paths needed are actually easier to obtain with DRAM than with SRAM arrays. For instance, DRAMs already have many more sense amplifiers on-chip than do typical SRAMs. (DRAMs being destructive read-out require one sense amplifier per each bit line on any word line. Each bit along a word line is read and must be sensed and regenerated.) On SRAM chips, the sense amplifiers are typically much larger, to provide speed, and require a larger "pitch" spacing (encompass more bit line pitches) as well as considerably more power. Thus, DRAM designs have at least this inherent advantage, although speed is compromised. However, even though DRAM chips have this inherent large data path width, this data path size is still a key issue in the overall chip design and layout.
One difficulty is that accessing a full congruence class of four lines from the data array requires a very large data path out of the array. For instance, a 128 byte L2 block (line) in a four-way, set-associative, late-select organization would require accessing a 128.times.8.times.4=4K bit data path out of the DRAM array. While doable, the resulting structure is somewhat large, and presents a number of difficulties in terms of array/island organization, length of word lines, bit lines, power, on chip busing, to name a few. A 256K byte line would obviously require twice this or an 8K bit data path out of the DRAM array which further compounds the difficulties. The problem is even worse for SRAM designs, so set-associative, late-select organizations are rarely used for SRAM L2 caches. Rather, Direct-mapped SRAM L2 cache organizations are typically used which compromises the performance and is to be avoided if possible.
The fundamental design issue is to provide an L2 cache design and organization which will not compromise speed but still allow a simple chip design using DRAM for the data array.