1. Field of the Invention
The present invention relates to techniques for improving computer system performance. More specifically, the present invention relates to a method and apparatus for selectively prefetching based on resource availability.
2. Related Art
Hardware prefetching (“automatic prefetching”) and software prefetching (“explicit prefetching”) are two well known techniques for enhancing the performance of the caches in computer systems (where “caches” include hardware structures such as data caches, instruction caches and Translation Lookaside Buffers (TLBs)).
In a computer system that uses hardware prefetching, system hardware monitors runtime data access patterns and uses the access patterns to make predictions about future loads and stores. Based on these predictions, the system hardware issues automatic prefetch requests in anticipation of accesses.
In a computer system that uses software prefetching, software applications are tuned by inserting explicit prefetch instructions into the executable code in order to minimize the number of cache misses by subsequent load and store operations. These prefetch instructions can be placed so that they complement the automatic (hardware) prefetch behavior of the processor on which the software is executing.
Using these two techniques should result in significant performance improvement. In practice, however, the techniques have several limitations which significantly reduce the expected performance improvement. In fact, these limitations are so significant that they have impeded the widespread adoption of these techniques.
One such limitation is “processor implementation sensitivity,” which occurs because the use of prefetches is highly dependent on the implementation details of the processor on which a software application is running. For example, the sizes of the caches, the number of ways in each cache, the load-to-use latency of a main memory access, and the clock frequency of the processor are processor implementation details that can affect the use of prefetches.
Processor implementation sensitivity is apparent in many commercially available software applications. Because software designers typically deliver only a single executable binary which is designed to be executed on a variety of different processors from the same processor family, the prefetch requests are not optimized for each possible processor implementation. The suppliers (1) choose the most popular variant within the processor family and optimize for that variant, or (2) deliver a binary that is suboptimal for any particular processor variant, but maximizes the average performance gain across the whole family.
A second limitation is “workload insensitivity,” which occurs in computer systems where more than one virtual or physical processor is sending prefetch requests to a shared cache. For a single application (or a single execution thread), aggressive prefetching can produce considerable performance gains—because the cache is dedicated to the single application. Unfortunately, in a multi-application system, two or more executables time-share a processor and that processor's caches. In this case, aggressive prefetching by one application can displace the data in the shared cache that another application is actively using. In fact, multiple executables, each performing self-interested prefetching, can cause so much interference in the cache that the resulting performance is significantly worse than if each executable was run sequentially—a phenomenon known as “thrashing in the cache.”
A third limitation is “shared resource insensitivity,” which occurs because modern processors contain multiple processor cores which share system resources in a complex arrangement. Caches are one example of a system resource that is particularly vulnerable to shared resource complications. Because of the technical difficulties involved with one processor assessing the cache footprints taken up by the other processor cores, each processor core may operate without knowing how much space is available in the cache. Therefore, the processor cores typically restrict the use of aggressive prefetching or risk overloading and thrashing in the cache.
Although cache performance is affected by shared resource insensitivity, caches are not the only system resource affected by this condition. The processor cores also share other resources, such as the system bus. Aggressive speculative prefetching may load these system resources with counterproductive or unnecessary work, hampering the efficient operation of the computer system.
A fourth limitation is “competitive prefetching interference,” which occurs in systems that combine hardware and software prefetching. In pathological cases, hardware and software prefetching can actually interfere with each other. Competitive prefetching interference occurs when the software issues prefetch requests which hardware has already issued, or when the false miss rate of the hardware and software prefetching combined exceeds the threshold at which actively used cache lines begin to be evicted, thereby causing cache thrashing.
Hence, what is needed is a method and apparatus which allows prefetching to be used aggressively without causing system-level performance degradation.