1. Field of the Invention
The present invention relates to the field of computers. More specifically, the present invention relates to computer architecture.
2. Description of the Related Art
Missing memory accesses have a large impact on overall performance in applications. The rapid increase in processor speed has generally led to latency that ranges from many hundreds to even thousands of processor clock cycles until an off-chip request is completed. With such substantial latency, relatively small miss rates, even 1% or lower, can significantly degrade application performance.
These missing memory accesses typically result from load instructions that miss in on-chip cache (e.g., L1 or L2 cache), and are serviced from off-chip cache or main memory. The load instructions that typically miss can be classified as clustered missing loads, distributed missing loads, or dependent missing loads. Clustered missing loads are those loads that miss and that are clustered together in the application with other loads that miss (i.e., given the current instruction is a missing load, there is a good probability that additional missing loads will be located in the next few hundred instructions to be issued). Distributed missing loads are those loads that are distributed throughout the application, without being close to other missing loads. Dependent missing loads are those loads with data sources or source addresses that are dependent on the data returned by previous missing loads (i.e., the missing loads are chained together). Dependent loads commonly occur during a linked list traversal.
If a memory access pattern is predictable, the loads are clustered, and the data sources of the loads are independent, it is possible for aggressive optimizing compilers, such as those provided by Sun Microsystems, Inc., to effectively insert prefetch instructions. The prefetch instructions move speculative data from load data sources into low-latency cache, such as on-chip cache, in advance of the use of the data. Since the data is already present in the low-latency cache, then the apparent latency of the load is reduced.
These optimizing compilers rely on profile data to insert prefetching instructions. In compiler-based prefetching, the compiler predicts a priori which loads are likely to miss the processor's off-chip caches. If the source address of the predicted missing load is known far in advance and the missing load is likely to be executed frequently, then the compiler inserts a software prefetch instruction for each missing load. Even advanced optimizing compilers rely on feedback profiling techniques to determine which memory accesses represent good candidates for prefetching.
Furthermore, automatic prefetch insertion by the compiler requires that the user leverage the appropriate optimization flags to enable prefetch insertion. Automatic prefetch insertion may also require the user to perform training runs to enable feedback profiling techniques. However, research has shown that the majority of users do not use the more aggressive optimization capabilities of the compiler, thus limiting the practical effectiveness of compiler-based prefetching in independent software vendor applications.
Hardware and software scanning techniques have been developed that address some of the limitations of compiler-based prefetching. Runahead execution (e.g., hardware scouting or software scouting), involve the processor “scouting” or speculatively executing past a missing load. When the load returns, the speculative path is aborted and execution continues at the use of the original missing load. When a load instruction misses, a processor switches from a normal execution mode into a hardware scouting mode. In scouting mode, the processor speculatively issues instructions from the predicted path of execution in an attempt to discover additional missing loads. By discovering the subsequent missing loads and prefetching for them, the subsequent missing loads can be processed in parallel with the first missing load, rather than in series, as would be the case for a processor without hardware scouting support. Hence, the total time expended waiting for a load request to be serviced from off-chip memory is reduced.
Software scouting is similar to hardware scouting, except the speculative execution and prefetching is performed using a code path that is generated by a static runtime compiler. The code path represents a possibly distilled forward path starting from the missing load. The code path contains those instructions necessary to calculate correct control flow and generate missing load source addresses and corresponding prefetches. A software scouting thread may either execute speculatively on an independent thread that runs in parallel with a master thread, or may be launched after the master thread is stalled by a missing load.
Both hardware and software scouting speculatively issue instructions from a predicted path of execution starting from the use of a missing load. Both scouting techniques can only execute a certain number of instructions before data for the missing load that triggered the scouting returns. This constraint limits the ability of scouting to look ahead and discover additional missing loads. Consequently, scouting is limited to discovering additional missing loads that are clustered with the triggering load. Additionally, as scouting moves further ahead, the probability that the projected execution path is still correct decreases, and the potential for prefetching source addresses that are not needed increases.
For example, assume a scout window of 250 instructions (i.e., a scouting mechanism is capable of executing 250 instructions before return of data). In a first scenario, a sequence of instructions includes 6 missing independent loads. The first missing load triggers transition of execution from normal mode to scout mode. During scout mode, the second and third missing loads (clustered loads) are encountered and, from the perspective of normal execution mode, prefetched. After a series of instructions, the third missing load triggers transition to scout mode. During scout mode, the fifth and sixth missing loads (clustered loads) are encountered and prefetching is performed for these missing loads. In the first scenario, the scouting mechanism has prefetched for 4 of the 6 missing loads and eliminated their corresponding miss penalties.
However, scouting cannot prefetch for dependent loads, despite clustering of the missing load instances. The first scenario assumed the missing load instances were independent. If the second missing load instance is dependent on an address requested by the first missing load instance, then scouting cannot prefetch for the second missing load instance because the address is unknown.
In a second scenario, a sequence of instructions includes three instances of missing loads. However, the missing load instances are distributed throughout the sequence of instructions, and each of the missing load instances falls outside of scout windows. The first missing load instance triggers a first transition to scout mode. After transition back to normal mode, the second missing load instance is encountered and triggers transition to scout mode. Once this scout mode ends and execution transitions back to normal mode, the third missing load instance is encountered. In this second scenario, scouting fails to discover the distributed missing load instances because they fall outside of the scout windows. Moreover, the divergence of the projected execution path during scouting from the correct execution path (e.g., due to branch mis-predictions, the diminishing probability that the values required for the address computation of distant missing loads are available) further limit the effectiveness of scouting with respect to distributed missing load instances.
Some techniques employ hardware context-based prefetch predictors and Markov prefetch predictors to eliminate L2 cache misses from clustered missing loads, distributed missing loads, and independent missing loads. These predictors use certain components of the current state of the processor and perhaps the history leading up to the current state to make predictions about the source address of the next missing load (next missing address). For instance, a typical Markov predictor maps the current missing address to the next predicted missing address that may be prefetched. In general, each predicted missing address is associated with an entry in prefetch predictions tables. Unfortunately, these context/Markov prefetch predictors generally require large prediction tables (often many megabytes in size). Furthermore, the techniques that utilize these hardware prefetch predictors reserve large portions of on-chip cache for prediction tables. Accordingly, a technique is desired that addresses the latency penalties of missing loads that are not discovered by conventional scouting techniques and that is not burdened by impractical or undesirable prediction tables.