1. Field of the Invention
This invention relates generally to the field of memory interface design and, more particularly, to cache design in a computer system.
2. Description of the Related Art
With present-day computer systems becoming increasingly more complex, and advances in technology leading to ever increasing processor speeds, it is becoming more and more difficult to optimize system performance, which oftentimes depends largely on the bandwidth and latency of the given system's memory. Consequently, accessing memory with the lowest latency, and highest availability of memory bandwidth may improve and/or optimize the system's performance. As the required time to access the memory and complete a given memory request increases, the system slows down. Thus, any reduction in access time, and/or an overall increase in throughput on the memory bus may benefit system performance.
A large number of systems, including desktop computers, graphics adapter cards and notebook computers among others, use Dynamic random access memory (DRAM). DRAM devices provide many advantages over other memory technologies, including and most notably, static random access memory (SRAM) devices. The most important of these benefits are higher storage densities and less power consumption. However, these benefits come at the expense of various time delays incurred when preparing the memory cells and other components within DRAM devices for each subsequent access, for example before/after each read/write access. Examples of such delays include the time required to perform row precharge, row refresh, and row activation. In order to more precisely manage and control memory operations when incurring these delays, additional commands—which are transmitted between read/write accesses—have been created, resulting in additional overhead.
Modern processors (e.g. microprocessors, controllers, microcontrollers or central processing units, i.e. CPUs) are typically faster than the memory (e.g. system DRAM) where the programs and program data are stored, resulting in the microprosessor potentially stalling, i.e. not operating at optimum performance, when instructions and/or data cannot be provided fast enough to the processor for the processor to keep executing in an uninterrupted manner. One solution to this problem has been the introduction of cache memory used by the processor to reduce the average time spent on accessing system memory. In general, a cache memory, most often simply referred to as cache, is a storage location that holds copies of data that may have been computed earlier, and/or stored elsewhere in the system, for example in system memory, from where it would generally take longer to fetch that data, e.g. due to longer access times, compared to the time it takes to fetch it from the cache. In other words, a cache is a temporary storage area where frequently accessed data can be stored for rapid access. Once the data has been stored in the cache, the data can be used in the future by accessing the cached copy rather than re-fetching or recomputing the original data, thereby reducing the average access time. Caches, therefore, help expedite data access that the processor would otherwise need to fetch the data from main memory.
In general, a processor cache is a typically smaller and faster memory used by the processor to store copies of the data from the most frequently used system (main) memory locations, and is oftentimes configured on the same die as the processor itself. The average latency corresponding to memory accesses will correspond more closely to the latency of the processor cache, (CPU cache), rather than to the latency of the system memory, if most memory accesses are to the CPU cache. In systems configured with processors operating with a cache memory, when the processor needs to access a location in main memory, it first checks to determine whether the data corresponding to that main memory location is in the CPU cache. This is typically performed by comparing the address of the memory location to all cache tags in the cache. If a tag in the cache corresponds to the address of the memory location, the lookup operation results in a cache hit, otherwise it results in a cache miss. In the case of a cache hit, the processor immediately accesses the data in the cache line. The proportion of cache accesses resulting in a cache hit is referred to as the hit rate, which is generally used as an indicator of the cache memory's effectiveness. When a cache miss occurs, most caches allocate a new entry comprising a tag that corresponds to the address of the memory location, and a copy of the data from system memory. The reference can then be applied to the new entry, just as for a cache hit. Cache misses are comparatively slow since they require accessing the main system memory, incurring a delay due to the difference in speed between the system memory and the cache, while also incuring an additional overhead required for storing the new data in the cache before it is delivered to the processor.
Many modern processors have at least three independent caches, which include an instruction cache to speed up executable instruction load, a data cache to speed up data load and store, and a translation buffer used to speed up virtual-to-physical address translation for both executable instructions and data. Another issue associated with caches is the tradeoff between cache latency and cache hit rate. Larger caches typically have better hit rates but have a longer latency. In order to optimize systems in view of this tradeoff, many computers use multiple levels of cache, or multiple cache-levels, with smaller, faster caches backed up by larger, slower caches. Multi-level caches generally operate by checking the smallest cache, which is typically designated as the lowest-level cache, e.g. Level 1 (L1) cache, first, and if the result is a cache hit, the processor may be able to proceed at a high speed. If the result for L1 cache is a cache miss, the next larger cache (L2) is checked, and so on, before the external memory has to be accessed. As the latency difference between main memory and the fastest cache has become larger, the number of cache levels has risen. For example, some processors are now configured with as many as three levels of on-chip cache, including a level 3 (L3) cache in addition to L2 and L1 caches.
Typically, the performance of many applications running on a given processor or processors can be limited by the amount of time the processor is stalled while servicing cache misses that require accesses to main memory. This is true even for very aggressive chip multithreading (CMT) processors. For example, for an aggressive 128-strand CMT processor with a memory latency of 360 processor clocks, simulations show that performance for TPC-C and SPECjbb2005 improves by 39% and 26% respectively if the cache misses are eliminated. For the SPECint2006 benchmark suite, the geometric mean performance of the suite improves by 18% if the cache misses are eliminated, and the performance of some of the individual benchmarks in the suite can improve by as much as 96%.
Oftentimes a microprocessor will use prefetching to speed up execution by reducing processor stalls. Prefetching typically comprises the processor fetching one or more instructions and/or data from the system memory some time before the processor actually needs the respective instructions and/or data. This eliminates the need for the processor to wait for the memory to answer its request. The prefetched instruction and/or data may simply be the next instruction and/or next required chunk of data in the currently running program, and may be fetched while the current instruction is being executed. The prefetch may also be part of a complex prediction algorithm, where the processor tries to anticipate the result of a calculation and fetch the proper instructions and/or data in advance. One common prefetching approach includes performing a sequential readahead. In its simplest form, a sequential readahead is implemented to prefetch one block beyond a requested block. When prefetching one block, the next block may be prefetched on each reference, it may be prefetched only on a miss, or it may be prefetched only if the referenced block is accessed for the first time. Another form of sequential prefetching implements prefetching a specified number of blocks, instead of a single block, with the number of prefetched blocks typically referred to as the degree of prefetch, or prefetch degree. In other words, the prefetch degree is an indicator used to determine how many prefetches are performed for a given memory access request. For example, if “X” was the last memory access request and “P” was the last prefetch that was issued in an access stream, then on the next access (i.e. access “X+1”) the prefetch degree would determine how many prefetches were issued. For example, if the prefetch degree was 4, then prefetches “P+1”, “P+2”, “P+3”, and “P+4” would be issued.
An advanced prefetching technique is correlation prefetching (hardware and/or software), which typically uses the current state of the reference or miss stream to predict and prefetch future misses. Experimental evaluations have shown that hardware correlation prefetching is a promising technique for eliminating cache misses. However, one of the drawbacks of correlation prefetching is its relatively poor accuracy. Inaccurate prefetches not only pollute caches but also waste memory bandwidth. Memory bandwidth is a precious resource in CMT processors where the memory bandwidth of the processor is shared by the many concurrently running threads. This problem can be aggravated by the increased cache miss rates caused by the sharing of the caches by the many threads.
Other corresponding issues related to the prior art will become apparent to one skilled in the art after comparing such prior art with the present invention as described herein.