1. Field of the Invention
The present invention relates, in general, to cache memory, and, more particularly, to a cache memory design using a reduced area for storing tag information used to access the cache memory.
2. Relevant Background
The ability of processors to execute instructions has typically outpaced the ability of memory subsystems to supply instructions and data to the processors. As used herein the terms "microprocessor" and "processor" include complete instruction set computers (CISC), reduced instruction set computers (RISC) and hybrids. Most processors use a cache memory system to speed memory access. Cache memory comprises one or more levels of dedicated high-speed memory holding recently accessed data, designed to speed up subsequent access to the same data.
Cache technology is based on a premise that programs frequently reuse the same instructions and data. When data is read from main system memory, a copy is also saved in the cache memory, along with the tag. The cache then monitors subsequent requests for data to see if the information needed has already been stored in the cache. If the data had indeed been stored in the cache, the data is delivered with low latency to the processor while the attempt to fetch the information from main memory is aborted (or not started). If, on the other hand, the data had not been previously stored in cache then it is fetched directly from main memory and also saved in cache for future access.
Another feature of access patterns to stored information is that they often exhibit "spatial locality". Spatial locality is a property that information (i.e., instructions and data) that is required to execute a program is often close in physical address space in the memory media (e.g., random access memory (RAM), disk storage, and the like) to other data that will be needed in the near future. Cache designs take limited advantage of spatial locality by filling the cache not only with information that is specifically requested, but also with additional information that is spatially near the specifically requested data. Efforts are also made to "prefetch" data that is spatially near specifically requested data.
A level 1 cache (L1 cache or L1$) is usually an internal cache built onto the same monolithic integrated circuit (IC) as the processor itself. On-chip cache is typically the fastest (i.e., lowest latency) because it is smaller in capacity and can be accessed at the speed of the internal components of the processor. It is contemplated that two or more levels of cache may be implemented on chip in which case the higher cache levels are slower than the L1 cache. On the other hand, off-chip cache (i.e., provided in a discrete integrated circuit separate from the processor) has much higher latency as the off-chip propagation delays are great and off-chip cache typically has very high capacity compared to on-chip cache structures. Off-chip cache has typically much shorter latency than accesses to main memory. In most designs, at least some high-level cache is provided off-chip.
Both on-chip and off-chip cache sizes of high-performance processors are continuously growing which tends to increase cache access latency relative to the processor. In contrast, processor clock speeds continually increase demanding more performance from the cache. For the foreseeable future, overall processor performance will often be limited by the cache and memory subsystem performance.
Each cache entry is typically accessed by an address tag stored separately in a tag random access memory (RAM). In a direct mapped cache each main memory address maps to a unique location in the cache. In fully associative cache, data from any main memory address can be stored in any cache location, hence, all address tags must be compared simultaneously (i.e., associatively) with the requested address, and if one matches, then its associated data is accessed. Set associative cache is a compromise between direct mapped cache and a fully associative cache where each address tag corresponds to a set of cache locations. A four-way set associative cache, for example, allows each address tag to map to four different cache locations.
Associative cache designs have a higher hit rate than similarly sized direct mapped caches and offer performance advantages in particular applications such as technical and scientific applications. Associative cache is more difficult to implement when the tag store information is located off-chip and in applications where each tag comprises a large number of bits. In a fully associative or set associative cache design, the processor references multiple tag store RAM locations simultaneously for best performance. This requires multiple parallel input/output (I/O) pins supporting communication between the processor and an off-chip tag store. For example, a 4-way set associative cache typically requires four times as many I/O pins between tag store and the processor than does a direct mapped cache for best performance. As physical memory addresses become larger, the number of I/O pins is unwieldy or impossible to implement. Many times these I/O pins simply are not available. For this reason, almost all external cache designs that are supported by a microprocessor are direct mapped.
The number of address tags required in the tag store is proportional to the size of the cache. However, not only the size, (i.e., the number of tag entries) but also the physical width of the address tag is typically growing because larger physical memories need to be addressed. Larger physical memory spaces require more address bits and correspondingly wider memory to store each address tag. The address tag RAM physical size or capacity is the product of these parameters and so is growing faster than the cache itself.
It is desirable to minimize access time to read the contents of the cache tag. The contents of the cache tag are read to determine if requested data exists in the cache or whether the data must be fetched from main memory or mass storage. The contents of the cache tag also provide address information needed to access the cached data. To minimize access latency to cache tag it is desirable to keep the cache tag information in low latency structures even for high latency caches and off-chip cache. However, because the area required by the cache tag is increasing faster than the cache itself, it is increasingly difficult to provide even the cache tag storage in low latency on-chip structures. A need exists for a cache design that reduces the area requirements for the tag store so that the tag information can be implemented on-chip and in small low latency structures.
One method of reducing the size of the cache tag store is to increase the atomic unit of information addressed by each cache tag. This can be done by increasing the "granularity" of the cache. The "granularity" of a particular cache level refers to the smallest quantity of data that can be addressed, often referred to as the size of a cache line. Larger cache lines hold more data in each line and so the address can be less specific (i.e., the address requires fewer bits). This also results in fewer cache lines for a given cache size which is the more important effect. However, larger cache lines frequently result in loading data into cache that is not used as an entire cache line is filled even for a small memory request. Hence, increasing granularity results in inefficient cache usage and wasted data bandwidth in many applications. Using a technique called sub-blocking, selected levels (usually higher cache levels) in a hierarchical cache have a higher tag granularity by providing a set of valid bits per tag. Each valid bit corresponds to the size of a cache line of the lower level cache. Hence, sub-blocking is a compromise that can be applied to improve cache efficiency of the lower cache levels while reducing the tag size of higher cache levels. Sub-blocking increases complexity of cache management, however, and in particular makes replacement more difficult. For example, sub-blocking is inefficient in inclusive cache designs.
The limitations of long latency cache pose particular problems in some processor applications. Particular examples include multiprocessing (MP) ma chines. In multiprocessors, instructions that incur a long latency memory access may result in stalling all instructions that operate on the long latency data. Typically the requesting processor will launch a memory access request and simultaneously broadcast a snoop request to all other processors. The other processors handle the snoop request by performing a tag store inquiry to identify whether a modified copy of the requested data exists in their cache. The requesting processor must wait until the inquiries are complete before committing itself to using the data obtained from the memory access re quests. Hence, it is desirable to minimize the portion of the snoop latency associated with accessing long latency tag store information.
In speculative execution processors, including uniprocessor and multiprocessor machines, some instructions cannot execute until a prior instruction is completed execution and their results are available. For example, an instruction that operates on d ata fetched from memory is dependent upon one or more preceding memory instructions (e.g., a load instruction) that fetch the required data from memory into working registers. The dependent instruction cannot execute until all of the stored values have been retrieved from memory. Also, some instructions determine an address for a subsequent memory access instruction and so the subsequent instruction cannot execute until the prior instruction's results are available. This results in a situation called "pointer chasing" that imposes the memory access latency on multiple instructions. In these applications and others, processor performance is very dependent on the latency to the various levels of cache and main memory.