1. Field of the Invention
This invention relates to the field of multiprocessor computer systems and, more particularly, to mechanisms and methods for prefetching data in multiprocessor computer systems.
2. Description of the Related Art
Cache-based computer architectures are typically associated with various features to support efficient utilization of the cache memory. A cache memory is a high-speed memory unit interposed in a memory hierarchy between a slower system memory and the microprocessor to improve effective memory transfer rates and, accordingly, improve system performance. The name refers to the fact that the small memory unit is essentially hidden and appears transparent to the user, who is aware only of a larger system memory.
An important consideration in the design of a cache memory subsystem is the choice of key design parameters, such as cache line size, degree of subblocking, cache associativity, prefetch strategy, etc. The problem in finding an “optimum setting” for these design parameters is that while improving one property, some others may be degraded. For example, an excessively small cache line may result in a relatively high number of capacity misses and in relatively high address traffic. A slightly longer cache line often decreases the cache miss rate and address traffic, while the data bandwidth increases. Enlarging the cache lines even more can result in increased data traffic as well as increased address traffic, since misses caused by false sharing may start to dominate. A further complication is that application behavior can differ greatly. A setting which works well for one application may work poorly for another.
Much research effort has been devoted to reducing the number of cache misses using various latency-hiding and latency-avoiding techniques, such as prefetching. Numerous prefetching schemes have been proposed, both software-based and hardware-based.
Software prefetching relies on inserting prefetch instructions in the code. This results in an instruction overhead, as well as resulting address traffic and snoop lookups.
Hardware prefetching techniques require hardware modifications to the cache controller to speculatively bring additional data into the cache. They often rely on detecting regularly accessed strides. A common approach to avoid unnecessary prefetches in multiprocessors is to adapt the amount of prefetching at run time. These proposals introduce small caches that detect the efficiency of prefetches based on the data structure accessed. Systems have been proposed to predict the instruction stream with a look-ahead program counter. A cache-like reference predictor table may be used to keep previous predictions of instructions. Correct branch prediction is needed for successful prefetching.
Another hardware prefetch approach is to exploit spatial locality by fetching data close to the originally used cache line. A larger cache line size can achieve this. Unfortunately, enlarging the cache line size is not as efficient in multiprocessor systems as in uniprocessor systems since it can lead to a large amount of false sharing and an increase in data traffic. The influence of cache line size on cache miss rate and data traffic has been the focus of various research. To avoid false sharing and at the same time take advantage of spatial locality, sequential prefetching fetches a number of cache lines having consecutive addresses on a read cache miss. The number of additional cache lines to fetch on each miss is called the prefetch degree.
A fixed sequential prefetch scheme issues prefetches to the K consecutive cache lines on each cache read miss. If the consecutive cache lines are not already present in a readable state in the cache, a prefetch message for each missing cache line is generated on the interconnect. The prefetch degree K is fixed to a positive integer in this scheme.
An adaptive sequential prefetch scheme is similar to the fixed sequential prefetch scheme, except that the prefetch degree K can be varied during run time. The prefetch degree is varied based on the success of previous prefetches. For example, one approach derives an optimal value of K by counting the number of useful prefetches. The protocol uses two counters that keep track of the total number of prefetches and the number of useful accesses to prefetched cache lines. Prefetched cache lines are tagged for later detection. Every sixteenth prefetch the useful prefetch counter is checked. If the number of useful prefetches is larger than twelve, K, is incremented. K is decremented if the number of useful prefetches is lower than eight or divided by two if less than three prefetches are useful. The scheme also has a method of turning prefetching on, since no detection can be carried out if the prefetch degree is lowered such that no prefetches are performed.
While most existing prefetch techniques efficiently reduce the amount of cache misses, they also increase the address traffic and snoop lookups, which are scarce resources in a shared-memory multiprocessor. This is especially true for systems based on snooping coherence, where each device has to perform a cache lookup for every global address transaction. The address networks of systems based on directory coherence are more scalable, since the address transactions are sent point-to-point. Still, systems based on snooping are often preferred because of their superior cache-to-cache transfer time. There is typically no difference in scalability of the data network between systems based on snooping coherence and systems based on directory coherence, since data packets can be sent point-to-point in both cases. Indeed, many commercial snoop-based systems have been built where the data network handles 50 percent more traffic than the available snoop bandwidth supports.
Thus, although various prefetch strategies have been successful in reducing the miss penalty in multiprocessing systems, it would be desirable to implement prefetching in a manner that reduces the cache miss rate without causing appreciable increases in address traffic and snoop lookups.