This invention relates generally to computer system cache memory access, and more particularly to enhancing timeliness of cache memory prefetching.
Increases in memory access delays have become one of the major concerns to microprocessor designers, particularly as processor pipelines get faster in raw execution speed; performance loss due to local cache misses becomes more significant. Data prefetching is a promising technique to soften such concern. Data prefetching speculates future memory accesses. By bringing predicted-demand-accesses into a target cache earlier than actually demanded, possible cache misses can be reacted to earlier, the target cache can then supply data without accruing the full delays when these cache misses are found only when they are demanded. Each cache miss involves the transfer of a unit of storage, namely a cache line. Each cache line typically includes multiple bytes of data, e.g. 64 or 128 bytes, while an address demanded may target only data at a smaller granularity, such as a single byte or word of data. The data may be instructions or operands for the instructions.
There are two key design elements in the design of data prefetching: what to prefetch and when to prefetch. Existing approaches focus primarily on determining what to fetch through detecting a repeated pattern exhibited by a sequence of memory references. The reference pattern is tracked down and prefetches issue as long as the pattern continues. This approach fails to consider when to prefetch.
FIG. 2 depicts an example of a stride pattern 200 with a stride distance d. When a demand of address X−d 202 occurs, a prefetching attempt with address X 204 happens and a cache line L+1 206 is speculatively brought into the target cache prior to the demand access for address X 204. When a demand for address X 204 occurs, a prefetching attempt with address X+d 208 happens. However, the attempt becomes void, because the address X+d 208 is mapped to the same cache line as the current demand access for address X 204. The prefetching attempt for the next cache line L+2 210 is finally made when the demand access steps through addresses X+2d 212 and X+3d 214, and reaches address X+4d 216. The effectiveness of prefetching for the cache line L+2 210 depends on how many cycles take place between the access to addresses X+4d 216 and X+5d 218. It is often true that time slack between two accesses (X+4d 216 and X+5d 218) is not enough to hide the memory latency if the cache line L+2 210 is not in the target cache. This conventional approach to prefetching is referred to herein as “standard stride prefetching”.
A conventional mechanism of multiple prefetch degrees may lessen the timing issue. Sometimes, the prefetch degree is also referred to as prefetch depth. FIG. 3 illustrates a case of three-degrees of prefetching that is applied to the reference stream of FIG. 2. When a demand access X−d 302 happens, a prefetch engine issues not only the cache line L+1 304 but also L+2 306 and L+3 308 at the same time. This approach brings the cache lines L+2 306 and L+3 308 into the target cache earlier than the standard stride prefetching described in reference to FIG. 2. However, the effectiveness of this scheme is challenged by the prefetching accuracy, as many bytes of data that are prefetched may not be needed.
As shown in FIG. 3, there are many speculations involved to bring the cache line L+2 306 and L+3 308 into the target cache because there may not be enough evidence to show that the reference stream actually extend to the cache line L+2 306 and L+3 308. One of the biggest drawbacks of data prefetching is not late prefetching but inaccurate prefetching. Inaccurate prefetching can hurt system performance for multiple reasons. For example, inaccurate prefetched data can kick out useful cache blocks (either demand blocks or accurate prefetched blocks) while they are still needed. Such premature replacement of useful blocks increases not only cache misses but also bus traffic because the system needs to bring them back into the cache. Inaccurate prefetching occupies the bus while transferring data from lower-level caches (caches that are further from the processor and closer to memory) into the target cache (caches that are closer to the processor and further from memory). Meanwhile, useful demanded blocks cannot use the bus, resulting in a delivery delay.
In summary, the multiple-degrees of prefetching trades off accuracy with timing, which can be problematic, especially when the size of the target cache is tight compared to what major workloads require, that is the most common case for both uniprocessor and multiprocessor. Such a choice can also negatively impact performance when bus bandwidth is very precious, which is the case for a multiprocessor with local caches connected through a shared bus. Accordingly, there is a need in the art to enhance timeliness of cache memory prefetching.