The present invention relates generally to the field of very large-scale integrated circuits fabricated on a single semiconductor die or chip. More particularly, the invention relates to the field of high-performance cache memories.
Cache memories have been used to maximize processor performance, while maintaining reasonable system costs, for many years. A cache memory is a very fast buffer comprising an array of local storage cells that is used by a processor to hold frequently requested copies of data. A typical cache memory system comprises a hierarchy of memory structures, which usually includes a local (L1), on-chip cache that represents the first level in the hierarchy. A secondary (L2) cache is often associated with the processor for providing an intermediate level of cache memory between the processor and main memory. Main memory, also commonly referred to as system or bulk memory, lies at the bottom (i.e., slowest, largest) level of the memory hierarchy.
In a conventional computer system, a processor is coupled to a system bus that provides access to main memory. An additional backside bus may be utilized to couple the processor to a L2 cache memory. Other system architectures may couple the L2 cache memory to the system bus via its own dedicated bus. Most often, L2 cache memory comprises a static random access memory (SRAM) that includes a data array, a cache directory, and cache management logic. The cache directory usually includes a tag array, tag status bits, and least recently used (LRU) bits. (Each directory entry is called a xe2x80x9ctagxe2x80x9d.) The tag RAM contains the main memory addresses of code and data stored in the data cache RAM plus additional status bits used by the cache management logic. By way of background, U.S. Pat. No. 6,115,795 discloses a computer system comprising a processor that includes second level cache controller logic for use in conjunction with an external second level cache memory.
Recent advances in semiconductor processing technology have made possible the fabrication of large L2 cache memories on the same die as the processor core. As device and circuit features continue to shrink as the technology improves, researchers have begun proposing designs that integrate a very large (e.g., multiple megabytes) third level (L3) cache memory on the same die as the processor core for improved data processing performance. While such a high level of integration is desirable from the standpoint of achieving high-speed performance, there are still difficulties that must be overcome.
Large on-die cache memories are typically subdivided into multiple cache memory banks, which are then coupled to a wide (e.g., 32 bytes, 256 bits wide) data bus. For instance, U.S. Pat. Nos. 5,752,260 and 5,818,785 teach interleaved cache memory devices having a plurality of banks consisting of memory cell arrays. In a very large cache memory comprising multiple banks, one problem that arises is the large RC signal delay associated with the long bus lines when driven at a high clock rate (e.g., 1 GHz). Thus, there is a need for some sort of repeater device to connect each bank of cache memory to the data bus without loss of signal integrity.
One traditional method for sharing a bus is to have each circuit utilize a tri-state driver in order to connect to the bus. Tri-state driver devices are well known in the prior art. A conventional tri-state driver comprises two transistor devices coupled in series to pull the output to either a high or low logic level. The third output state is a high impedance (i.e., inactive) state.
When a tri-state driver is utilized to connect to a bus, the two series-connected output devices of the driver need to be large so as to provide adequate drive strength to the long bus wire. This requirement, however, makes it difficult to use tri-state drivers as repeaters in a multi-megabyte on-die cache memory because the large source/drain diode of the output devices adds considerable load to the bus. The additional load attributable to the tri-state drivers increases bus power and causes significant resistive/capacitive (RC) signal delay. Another drawback of using tri-state drivers as repeaters is the need for decoding circuitry for the drivers. This decoding circuitry is in addition to the decoding circuitry already required for the cache memory banks.
The requirement of sharing the data bus between banks in a large cache memory also creates timing difficulties. The sub arrays within a bank may be placed close enough such that the individual bits of the bus will have about the same timing. However, the cache banks themselves are often located at various physical distances from the receiver or central location on the die that provides a point for information transfer to the processor core. This means that the relative signal timing of data to/from each bank may be very different.
For example, one bank may be located far from the core (or some central location on the die that provides a point for information transfer between the processor and the cache) whereas another bank may be located adjacent to the core. The farther bank would incur a significant signal delay due to the RC nature of the metal lines whereas the nearer bank would not. In other words, some data bits travel a long distance and have a long delay, while other data bits travel a short distance and have a short delay to reach the receiver. At high processor speeds and with very large cache sizes, it can take one or more clock cycles for the bits that are farthest away to arrive at the receiver relative to the bits that are closest. That is, even though data is sent/received synchronously with the clock, the RC delay of the long metal lines prevents the data signals from traversing the distance between a bank and the core in a single clock cycle.
Very large on-die caches also present further difficulties in the implementation of redundant storage elements. In traditional cache designs with redundancy, the redundant array element is read at the same time all the other array elements are read. The selection of which bits are output from the cache is typically controlled through multiplexing. When an array element fails, fuses on the chip are usually blown in order to decode the defective bits out and replace them with the redundant element. The drawback of this approach is that if the cache is very large, the multiplexing problem is huge. For example, if the cache outputs 256 bits, then the redundant element has to have multiplexing connections to be able to feed the data to any one of those 256 bits. Naturally, a huge overhead problem is created by such connections.
Therefore, what is needed is a cache architecture that overcomes the shortcomings of the prior art in the design of a very large, on-die cache memory operating with a high-speed processor core.