The present invention relates to an apparatus and method for caching program instructions in a processor system. More particularly, the present invention is a new method for providing high prefetch accuracy while using less hardware than previous methods.
A computer system, in its most essential from, comprises of a processor, a main memory and an I/O device with which the computer system communicates with an end-user. The end-user provides the computer system with a program comprising a set of instructions or codes directing the processor to perform different tasks. Generally, the tasks involve manipulating data that is provided to the computer system by the end-user. Both the data and the codes are stored in the main memory which is typically a Dynamic Random Access Memory or DRAM. The processor has to fetch the codes and the data, manipulate it according to the program, and then store the result back in the DRAM.
Both the processor and the memory have become faster and faster as the technology has advanced in the field of electronics. However, the speed with which today""s processors are able to execute instructions remains much faster relative to the speed with which the memory is able to deliver stored data. This difference of speed, referred to as memory latency, causes an obvious problem. The processor has to remain idle while it is waiting for the slower memory to make the next piece of data available. Reducing memory latency is of great interest to computer users because it will result in improving the overall performance of the computer system.
One way to reduce memory latency is to utilize a faster intermediate level of memory known as Cache. Cache is a fast memory storage device that stores blocks of data and codes recently used by the processor. However, cache is also more expensive, and thus only a relatively small size cache is used in conjunction with the DRAM. The way Cache works is as follows. When the processor requests data, that data is transferred from DRAM to cache and then from cache to the processor. This way a copy of the data will remain in cache. On the next processor request for data, the much faster cache is checked prior to sending the request to DRAM to see whether the requested data is available locally in cache. If it is, then there is no need to retrieve the data from the DRAM and the processor can get its request filled at the cache (a cache hit). On the other hand, when the cache does not contain the requested data or code, a cache miss occurs. In this case, the data must be retrieved from the DRAM, and the processor is unable to save any time as it would through a cache hit. Thus it is extremely desirable to reduce cache misses or increase cache hits.
Several methods have been suggested to reduce cache misses. For example, hardware prefetching can be an extremely effective technique for reducing cache misses. One of the most common prefetching techniques, known as inline or next-in-sequence, is to prefetch the next consecutive cache line on a cache access. For example, if the processor requests data stored in cache line X, then the hardware generates a prefetch for cache line X+1. The hardware is guessing that the program will want the following cache line next. If the guess is correct, then prefetching has avoided a cache miss. Eliminating cache misses reduces the effective memory latency and has a positive impact on overall system performance. However, if the guess was incorrect and the cache line X+1 is not used by the processor, then the prefetch has been a waste and could have actually caused harm to system performance by clogging the paths between the processor and the memory.
Performance could also be degraded by a condition commonly referred to as cache pollution. When a prefetched cache line is placed in the cache, another cache line must be evicted in order to make room for the new entry. If the prefetched line is subsequently used by the processor, a miss has been avoided and performance is improved. However, if the processor never requests the prefetched line but instead requests the cache line that was evicted, then a cache miss has been created. Cache pollution occurs when the hardware prefetcher fills the cache with unused prefetches and generates additional cache misses. If the cache becomes too polluted, the miss rate will increase and prefetching will actually have a negative impact on performance.
A common method of preventing cache pollution is through the use of a Prefetch Buffer (PFB). When a prefetch request returns from memory, the prefetched data is stored in the PFB, instead of the cache. When the processor requests data, both the cache and the PFB are searched to see whether the data is available. If the data is found in the PFB, the prefetched data is transferred to the cache. This guarantees that only data that has been requested by the processor resides in the cache, and prevents cache pollution ensuring that no matter how inaccurate the hardware prefetcher, it will not increase the cache miss rate.
Although the PFB is a very effective filtering mechanism it is highly inefficient. Each entry requires both an address tag of usually around 10 bits and a 32-byte cache line to be stored. However, a large portion of the entries do not get used by the processor. Entries that do not get used are wasted data storage space. Although the address tag of a bad prefetch may be used to prevent prefetching to the same address again, the 32-bytes of data stored for the bad prefetch is a complete waste of hardware space. It would be desirable to accomplish the same filtering results but with less hardware.