A cache memory provides a high speed interface between the Central Processing Unit (CPU) of a computer system and its main memory. As shown in FIG. 1, a smaller and faster cache memory provides the CPU high speed access to a larger, but slower main memory. The cache operates by maintaining a copy of those portions of the main memory's data that are likely to be used by the CPU. If the cache has the requested data, the CPU receives the data without having to incur the delay associated with a read from the main memory. The CPU does not need to know explicitly about the existence of a cache.
The simplified representation of a cache memory in FIG. 1 illustrates the basic concept of a cache. In its basic form, the cache appears to the CPU as if it were the main memory itself. That is, there is a fundamental interface between the processor and the cache memory in which the processor supplies a memory address to the cache and the cache provides the requested data back to the processor. This is the same basic function and interface that would be expected of a directly connected main memory.
Caches are, however, fundamentally different from main memories in some ways. The primary difference is that while main memory consists of sufficient random access memory to represent the entire range of allowable memory accesses by the processor, a cache memory allows for the entire range of memory addresses, while in fact consisting of a much smaller array of actual random access memory. A cache memory therefore relies on a coexistence with the main memory of the computer system to maintain the entirety of the system's memory. Because a cache memory can only store a subset of the main memory's contents, a mechanism is provided for identifying the portion of the main memory it actually contains. In addition, mechanisms are provided for replacing the cache memory's contents and for maintaining consistency with the main memory.
In the diagram of FIG. 1 there is means for data transfer between the cache and the main memory as well as means for data transfer between the CPU and the cache. While the amount of data transferred between the cache and the CPU is dictated by the instructions executed by the CPU, data transfers between the cache and main memory are in fixed block sizes. Each of these memory blocks (also called lines) residing in the cache has an associated identifier (tag) which uniquely identifies the block in the cache with its corresponding block in main memory. The cache memory also includes a means for determining whether memory accesses made by the CPU are resident in the cache, or whether they must involve an access to the main memory for servicing.
FIG. 2 shows a diagram of the primary components of a cache memory implementation. Standard cache memory systems are implemented with two distinct memory components, a tag memory element and a data memory element. As noted above, the data element contains copies of blocks of data from the system main memory. The tag element contains an identifier for each block in the data element. The tag identifies the data block by the address used to access the data in the main memory. When the CPU seeks to access an instruction or data in memory it supplies the address associated with that access to the cache memory. The cache memory system is responsible for comparing the requested address with the addresses of valid lines of data held in the cache. FIG. 2 also shows a mechanism for forwarding the address provided by the processor to the main memory as required in the event the data is not resident in the cache.
The actual implementation of a data comparison is dictated by the structural organization of the cache. Two fundamental types of cache memory organizations exist: direct mapped and associative. In a direct mapped cache, each block of main memory has a pre-assigned location in the cache memory. The comparison function for a direct mapped cache need only compare the tag associated with the pre-assigned block in the cache with the address supplied by the CPU. In an associative mapped cache organization the CPU supplied address is compared with all of the cache tags.
Most cache implementations use a hybrid of these two methods known as a set associative organization. In a set-associative mapping, each block in main memory is assigned to a set of cache blocks. When a set-associative cache is employed, the address issued by the CPU is compared with only those cache tags corresponding to the set of blocks to which the specified memory block is mapped. With all of these implementations, the tag corresponding to a cache line consists of a sufficient number of memory address bits to uniquely identify the specific block of main memory represented by the cache block, a valid bit and usually other bits to identify the particular state of the cache line. A valid match of the CPU supplied address with a valid tag address indicates that the line is present in the cache.
As pipelined processor execution speeds have increased relative to main memory access times, modern computer systems have generally utilized a plurality of cache memories. Typically a very high speed first level cache is built as part of the microprocessor block. FIG. 3 shows a conceptual diagram of a microprocessor with an on chip first level cache connected to a second level cache which in turn interfaces to the main memory. As many levels of cache as is practical may be used. Many modern microprocessors have two on chip caches and may further be built into systems which employ an off chip third level cache.
The performance of a computer memory system relates to how quickly memory accesses from the CPU are processed on average. When cache memories are used, there is a distinction of access time between cases where the data being requested is resident in the cache (a cache hit) or not (a cache miss). Cache hit performance is enhanced by making the memory access faster and by improving the hit rate. Cache design also focuses on the cost of implementation.
Cache system design involves making tradeoffs in speed, hit rate and cost. It is well documented that for general applications, the larger the cache, the better will be the hit rate, and thus the performance. However, the larger RAM arrays needed for larger caches are typically slower than smaller arrays, negating some of the potential gain from increased cache sizes. The use of associative caches or set associative caches helps to provide typically better hit rates as compared to direct mapped caches though at an additional cost in the design.
There are also definite physical barriers to desired cache implementations. The size of a cache memory built on a microprocessor chip is limited by the costs and yield loss resulting from larger die sizes. Off chip caches may more easily accommodate large cache sizes, but are limited by the restrictions on number of microprocessor chip pins that can be practically used to transfer addresses and data between the processor and the memory system. Further, a multiplicity of chips may be required to implement the off-chip cache resulting in increased system cost.
The main memory of a computer system is built with random access memory devices (RAMs). The RAMs are accessed by an address supplied by the CPU. The contents of the RAMs are either instructions to the CPU or data to be manipulated by the CPU. The data and tag elements of a cache memory system are also implemented with some form of RAM. A portion of the same address used to access the main memory is also used to access the data and tag arrays of the cache memory. A sample implementation of an external cache in a microprocessor based system is shown in FIG. 4.
The level 2 cache as represented in the diagram of FIG. 3 is represented by three distinct components in FIG. 4, a system control chip, a cache data element, and a cache tag element. In a typical system, the system control chip provides the physical link between the microprocessor and the other components of the computer system. These components include the main memory (shown) and system I/O components (not shown). The cache data memory element typically consists of a plurality of standard SRAM (static random access memory) chips. The tag memory element typically consists of one or more specialty SRAM chips that store the level 2 tags. These specialized tag RAMs include comparison circuitry for identifying whether the memory address supplied by the microprocessor matches the data resident in the level 2 cache. The result of this comparison is supplied to the microprocessor and system controller. The microprocessor uses this tag match to determine whether the requested data can be obtained from the level 2 data element. Similarly, the system controller uses the tag match indication to determine whether to continue processing the main memory access request.
Tag RAMs tend to be highly specialized to the particular application for which they are designed and thus tend to be significantly more expensive for the size of the arrays than are the more general purpose RAMs used for the data arrays. The added cost is due largely to the addition of special tag comparison circuitry, as described above. The use of tag RAMs thus adds a non-trivial additional cost to the implementation of these off chip caches.
Because of the costs associated with the implementation of off chip caches, efforts have been taken to try to achieve the benefits of these caches at reduced system cost. One such approach has been to build larger caches on the same chip as the processor. This has included the frequent use of multiple on-chip caches. The PMC RM7000 family of processors and many other microprocessors are examples of this. The existence of larger on-chip caches in many cases allows for adequate system performance without the addition of an off chip cache. In some systems, however, an off chip cache is still desirable.
Another approach has been to incorporate the tag element of an external cache memory on the processor chip itself. This avoids the need to provide a specialized tag RAM for the system. With this approach, however, the microprocessor die size is increased by the area required for the tag RAM, resulting in significantly higher manufacturing costs. In addition, the cost of the embedded tag RAM is incurred regardless of whether an external cache is actually implemented within the particular computer system.
RAM manufacturers have also made efforts to reduce the costs of tag RAMs used for off-chip cache implementations. These efforts focus on aspects of the manufacturing of the cache data and cache tag RAM chips. For example, U.S. Pat. No. 5,905,996, granted on May 18, 1999, discloses a cache design in which the tag memory is included within the same integrated circuit chip as the data memory. This approach allows the memory supplier to provide the tag and data functionality without the expense of manufacturing two separate parts. This allows the manufacturer to target the most cost effective array sizes in a given technology. This dual function chip is still somewhat specialized in that it includes the appropriate tag functionality as specified by the system requirements. As a result, these dual-function RAM devices are likely to be significantly more expensive than general purpose RAMs traditionally used for cache data arrays.
Another known approach involves increasing the width of an internal RAM array so that each cache data entry can be stored together with its associated tag bits. The tag bits corresponding to the addressed data entry are read simultaneously with the data bits. See “Design of High-performance Microprocessor Circuits,” IEEE Press copyright 2001, edited by Chandrakasan, Bowhill, and Fox, page 287. (The width of a RAM array represents the number of bits of memory that can be accessed simultaneously, while the depth of the array represents the number of distinct groups of these bits that are available.)
In a direct mapped cache implementation, the width of the RAM array required to implement the data array is determined by the width of the data transfer between the cache and the processor. In the case of a set associative cache, the width of the array required is multiplied by the degree of associativity. A four way associative cache requires four times the RAM array width of a comparable direct mapped cache implementation. U.S. Pat. No. 5,905,997 granted to AMD on May 18, 1999 relates to implementing the tag bits within a portion of the array width that would ordinarily be allocated to a data array in such an associative cache. In most applications, the additional array width required for the multiplicity of associativities is provided by implementing a separate array for each degree of associativity.
In the AMD patent, a portion of one of the N RAM arrays used to implement an N way associative cache is used to provide the tags associated with the other N−1 arrays. Because this first array is used for tags, and is only partially usable for data, this approach requires that N is two or more.
Another problem with the design of a cache memory system using separate data and tag arrays is that system designers typically cannot take advantage of advances in fabrication technology unless these advances have been incorporated into both types of memories. For example, tag RAM chips that implement new electrical interface standards may not become commercially available until well after such interface standards have been incorporated into general purpose SRAM chips.