1. Field of the Invention
The present invention relates, in general, to cache memory, and, more particularly, to a cache memory design using long latency cache augmented by a short latency cache for caching the cache tag information held in the long latency cache.
2. Relevant Background
The ability of processors to execute instructions has typically outpaced the ability of memory subsystems to supply instructions and data to the processors. As used herein the terms "microprocessor" and "processor" include complete instruction set computers (CISC), reduced instruction set computers (RISC) and hybrids. Most processors use a cache memory system to speed memory access. Cache memory comprises one or more levels of dedicated high-speed memory holding recently accessed data, designed to speed up subsequent access to the same data.
Cache technology is based on a premise that programs frequently reuse the same instructions and data. When data is read from main system memory, a copy is also saved in the cache memory, along with the tag. The cache then monitors subsequent requests for data to see if the information needed has already been stored in the cache. If the data had indeed been stored in the cache, the data is delivered with low latency to the processor while the attempt to fetch the information from main memory is aborted (or not started). If, on the other hand, the data had not been previously stored in cache then it is fetched directly from main memory and also saved in cache for future access.
Modern processors support multiple cache levels, most often two or three levels of cache. Typically multi-level or hierarchical cache designs comprise lower cache structures that have shorter access latency than higher cache levels. As used herein, a lower cache level is a cache structure logically closer to the execution units of the processor whereas a higher cache level is logically closer to main memory or mass storage. Access latency is determined by a number of factors including physical size or capacity of the cache structure, size and complexity of the memory cells making up the cache structure, and location of the cache structure with respect to the memory access units and instruction execution units of the processor (i.e., on-chip or off-chip).
A level 1 cache (L1 cache or L1$) is usually an internal cache built onto the same monolithic integrated circuit (IC) as the processor itself. On-chip cache is typically the fastest (i.e., lowest latency) because it is smaller in capacity and can be accessed at the speed of the internal components of the processor. It is contemplated that two or more levels of cache may be implemented on chip in which case the higher cache levels are slower than the L1 cache due to physical layout, memory cell size, or a combination of other factors. On the other hand, off-chip cache has much higher latency as the off-chip propagation delays are great and off-chip cache typically has very high capacity compared to on-chip cache structures. Off-chip cache is typically much shorter latency than accesses to main memory.
Both on-chip and off-chip cache sizes of high-performance processors are continuously growing which tends to increase cache access latency relative to the processor. In contrast, processor clock speeds continually increase demanding more performance from the cache. For the foreseeable future, overall processor performance will often be limited by the cache and memory subsystem performance.
Each cache entry is typically accessed by an address tag stored separately in a tag random access memory (RAM). In a direct mapped cache each main memory address maps to a unique location in the cache. In fully associative cache, data from any main memory address can be stored in any cache location, hence, all address tags must be compared simultaneously (i.e., associatively) with the requested address, and if one matches, then its associated data is accessed. Set associative cache is a compromise between direct mapped cache and a fully associative cache where each address tag corresponds to a set of cache locations. A four-way set associative cache, for example, allows each address to map to four different cache locations.
Associative cache designs have a higher hit rate than similarly sized direct mapped caches and so offer performance advantages in particular applications such as technical and scientific applications. Associative cache is more difficult to implement when the tag store information is located off-chip or in a long latency tag RAM. In a fully associative or set associative cache design, the processor references multiple tag store RAM locations simultaneously for best performance. This requires multiple parallel input/output (I/O) pins supporting communication between the processor and an off-chip tag store. For example, a 4-way set associative cache typically requires 4 times as many I/O pins coupled to the tag store than does a direct mapped cache for best performance. As physical memory addresses become larger, the number of I/O pins is unwieldy or impossible to implement. Many times these I/O pins simply are not available. For this reason, almost all external cache designs that are supported by a microprocessor are direct mapped.
The number of address tags required in the tag store is proportional to the size of the cache. However, not only the size, (i.e., the number of tag entries) but also the physical width of the address tag is typically growing because larger physical memories need to be addressed. Larger physical memory spaces require more address bits and correspondingly wider memory to store each address tag. The address tag RAM physical size or capacity is the product of these parameters and so is growing faster than the cache itself.
It is desirable to minimize access time to read the contents of the cache tag. The cache tag contents are read to determine if requested data exists in the cache or whether the data must be fetched from main memory or mass storage. Access latency is greatest in higher cache levels. In most designs, at least some high-level cache is provided off-chip (i.e., in a discrete integrated circuit separate from the processor) therefore exhibiting even longer access latency.
To minimize access latency to cache tag it is desirable to keep the cache tag on-chip even for off-chip cache. However, because the area required by the cache tag is increasing faster than the cache itself, it is increasingly difficult to provide even the cache tag storage on-chip. A need exists for a cache design that reduces the area requirements for the tag store so that the tag information can be implemented on-chip to minimize cache tag access latency.
One method of reducing the size of the cache tag store is to increase the atomic unit of information addressed by each cache tag. This can be done by increasing the "granularity" of the cache. The "granularity" of a particular cache level refers to the smallest quantity of data that can be addressed, often referred to as the size of a cache line. Larger cache lines hold more data in each line and so the address can be less specific (i.e., the address requires fewer bits). This also results in fewer cache lines for a given cache size which is the more important effect. However, larger cache lines frequently result in loading data into cache that is not used as an entire cache line is filled even for a small memory request. Hence, increasing granularity results in inefficient cache usage and wasted data bandwidth in many applications. Using a technique called sub-blocking, selected levels (usually higher cache levels) in a hierarchical cache have a higher tag granularity by providing a set of valid bits per tag. Each valid bit corresponds to the size of a cache line of the lower-level cache. Hence, sub-blocking is a compromise that can be applied to improve cache efficiency of the lower cache levels while reducing the tag size of higher cache levels. Sub-blocking increases complexity of cache management, however, and, in particular, makes replacement more difficult. For example, sub-blocking is inefficient in inclusive cache designs.
The limitations of long latency cache pose particular problems in some processor applications. Particular examples include multiprocessing (MP) machines. In multiprocessors, instructions that incur a long latency memory access may result in stalling all instructions threads that operate on the long latency data. Typically the requesting processor will launch a memory access request and simultaneously broadcast a snoop request to all other processors. The other processors handle the snoop request by performing a tag store inquiry to identify whether a modified copy of the requested data exists in their cache. The requesting processor must wait until the inquiries are complete before committing itself to using the data obtained from the memory access requests. Hence, it is desirable to minimize the portion of the snoop latency associated with accessing long latency tag store information.
In speculative execution processors, including uniprocessor and multiprocessor machines, some instructions cannot execute until a prior instruction has completed execution and the results are available. For example, an instruction that operates on data fetched from memory is dependent upon one or more preceding memory instructions (e.g., a load instruction) that fetch the required data from memory into working registers. The dependent instruction cannot execute until all of the stored values have been retrieved from memory. Also, some instructions determine an address for a subsequent memory access instruction and so the subsequent instruction cannot execute until the prior instruction's results are available. This results in a situation called "pointer chasing" that imposes the memory access latency on multiple instructions.
In these applications and others, processor performance is very dependent on the latency to the various levels of cache and main memory. To expedite execution of dependent instructions, each cache level is probed by examining the tag store for that cache level to determine whether the required data exists in that cache level. Based upon results of the tag store inquiry, the dependent operation can be launched sooner when it is determined that the required data exists in cache. In the case of off-chip cache, however, the off-chip tag information can take many microprocessor cycles to return.