The speed at which computer processors can execute instructions continues to outpace the ability of computer memory systems to supply instructions and data to the processors. Consequently, many high-performance computing systems provide a high-speed buffer storage unit, commonly called a cache or cache memory, between the working store or memory of the central processing unit (“CPU”) and the main memory.
A cache comprises one or more levels of dedicated high-speed memory holding recently accessed data, designed to speed up subsequent access to the same data. For the purposes of the present specification, unless specified otherwise, data will refer to any content of memory and may include, for example, instructions, data operated on by instructions, and memory addresses. Cache technology is based on the premise that computer programs frequently reuse the same data. Generally, when data is read from main system memory, a copy of the data is saved in the cache memory, along with an index to the associated main memory. For subsequent data requests, the cache detects whether the data needed has already been stored in the cache. For each data request, if the data is stored in the cache (referred to as a “hit”), the data is delivered immediately to the processor while any attempt to fetch the data from main memory is not started or aborted if already started. On the other hand, if the requested data is not stored in the cache (referred to as a “miss”) then it is fetched from main memory and also saved in the cache for future access.
A level 1 cache (“L1”) generally refers to a memory bank built closest to the central processing unit (“CPU”) chip, typically on the same chip die. A level 2 cache (“L2”) is a secondary staging area that feeds the L1 cache. L2 may be built into the CPU chip, reside on a separate chip in a multichip package module, or be a separate bank of chips.
Address predictors are used to anticipate or predict future addresses in applications such as data prefetching or instruction scheduling. Prefetching systems and methods attempt to reduce memory latency by reducing the probability of a cache miss. The probability of a cache miss is reduced by anticipating or predicting what information will be requested before it is actually requested.
One type of prefetcher used to decrease the impact of cache misses on processor performance is referred to herein as a run-ahead prefetcher. The run-ahead prefetcher is independently sequenced and is allowed to progress an arbitrary distance ahead of the processor. In particular when the processor stalls, the run-ahead prefetcher can continue to operate.
Since the sequencing of run-ahead prefetching is done independently of the processor's program sequencing, it is possible for the run-ahead prefetcher to overflow in the cache. Two types of overflows can occur, the first is referred to as prefetch overflow. Prefetch overflow occurs when the run-ahead prefetcher makes allocations that cause older prefetches to be replaced. This would occur if the number of entries in the cache is N, but the run-ahead prefetcher has made N+1 allocations that have not yet been referenced by the processor. Normal Least Recently Used (LRU) replacement would cause the oldest element (the first allocation) to be replaced by the new N+1 allocation. The second type of overflow occurs when an allocation initiated by the run-ahead prefetcher replaces a cache line allocated during normal execution that is still in use.
Ultimately, overflow detracts from the benefit provided by the run-ahead prefetcher. In the worst case, overflow completely eliminates the benefit of the run-ahead prefetcher or even degrades performance. What is needed is a run-ahead prefetcher with the capability to execute further ahead of the normal thread to expose more cache misses, while preserving the benefits of past allocations.