1. Field of the Invention
The present invention relates in general to a method and apparatus for managing processor load instructions, which combines load value prediction with checkpointed early load retirement to handle long latency and other performance reducing loads more efficiently.
2. Description of the Background Art
Modern processors typically retire instructions in program order. In-order retirement enables precise bookkeeping of the architectural state, effectively making out-of-order execution transparent to the user. When, for example, an instruction raises an exception, the processor continues to retire instructions up to the excepting one. At that point, the processor's architectural state reflects all the updates made by preceding instructions, and none of the updates made by the excepting instruction or its successors. Then, the exception handler is invoked.
In-order retirement also means that an unresolved long latency instruction may remain at the processor's retirement stage for many cycles. This is often the case for loads that miss in the cache hierarchy, whose penalty is already severe in today's processors, and is bound to be worse in future systems due to the increasing processor-memory speed gap. These long latency loads may hinder processor performance mainly in two ways: First, because their results are not available for many cycles, potentially long chains of dependent instructions may be blocked for a long time. Second, because instruction retirement is effectively disabled by these long latency memory operations, executed instructions hold on to critical resources for many cycles. Upon running out of resources, the processor stops fetching new instructions and eventually stalls.
Conventional load-value prediction addresses the first problem by supplying a predicted value on an unresolved load. The prediction can be provided early in the processor pipeline. Dependent instructions may then execute using this prediction. Once the value comes back from memory, it is compared against the predicted value. If they match, the instruction is deemed complete; if they do not, a replay of dependent instructions (and, possibly, all instructions after the load) takes place—this time with the right value.
In practice, however, the effectiveness of conventional load-value prediction is limited by the second problem: Indeed, because the processor must ultimately compare the loaded and the predicted values, unresolved long-latency loads continue to clog the processor at retirement. In other words, the predicted load and all the subsequent instructions in program order remain in the processor, holding precious resources such as physical registers or reorder buffer entries until the load value is verified. If, for example, the load misses in all levels of the local cache hierarchy, this frequently blocks retirement, eventually bringing the processor to a stall. As a result, conventional load-value prediction may not be as effective with this type of loads.
Runahead execution was first used to improve the data cache performance of an in-order execution core. More recently, a Runahead architecture for out of-order processors has been proposed. The architecture “nullifies” and retires a memory operation that misses in the L2 cache and remains unresolved at the time it gets to the ROB head. It also takes a checkpoint of the architectural registers, to be used to come out of Runahead mode when the memory operation completes. The instructions that depend on the nullified operation do not execute, but are nullified in turn, and hence retire quickly. Moreover, any long-latency load encountered during Runahead execution (regardless of its position in the ROB) and its dependence chain are also nullified. Other instructions execute normally, but without overwriting data in memory. When the operation completes, the processor systematically rolls back to the checkpoint and resumes conventional execution. Although execution in Runahead mode is always discarded, it effectively warms up caches and various predictors, thereby speeding up the overall execution.
Experimental results have shown that, for an important number of applications, blocked ROB time accounts for a very significant fraction of the total execution time in a modern out-of-order processor. Worse still, most of this blocked ROB time falls under Blocked-Stall category—that is, the processor is not able to do any work. Moreover, although the addition of a hardware prefetcher helps noticeably in a few cases, in general the problem remains. As the processor-memory speed gap widens, a solution to alleviate the problem of long-latency misses is needed.