1. Field of the Invention
The present invention relates to techniques for improving the performance of computer systems. More specifically, the present invention relates to a method and apparatus for allocating processor resources, such as cache locations, during speculative execution using a temporal ordering policy.
2. Related Art
Advances in semiconductor fabrication technology have given rise to dramatic increases in microprocessor clock speeds. This increase in microprocessor clock speeds has not been matched by a corresponding increase in memory access speeds. Hence, the disparity between microprocessor clock speeds and memory access speeds continues to grow, and is beginning to create significant performance problems. Execution profiles for fast microprocessor systems show that a large fraction of execution time is spent not within the microprocessor core, but within memory structures outside of the microprocessor core. This means that the microprocessor systems spend a large fraction of time waiting for memory references to complete instead of performing computational operations.
When a memory reference generates a cache miss, the subsequent access to level-two (L2) cache (or main memory) can require dozens or hundreds of clock cycles to complete, during which time the processor is typically idle, performing no useful work.
A number of techniques are presently used (or have been proposed) to hide this cache-miss latency. Some processors support out-of-order execution, in which instructions are kept in an issue queue, and are issued “out-of-order” when operands become available. Unfortunately, existing out-of-order designs have a hardware complexity that grows quadratically with the size of the issue queue. Practically speaking, this constraint limits the number of entries in the issue queue to one or two hundred, which is not sufficient to hide memory latencies as processors continue to get faster. Moreover, constraints on the number of physical registers that can be used for register renaming purposes during out-of-order execution also limit the effective size of the issue queue.
Some processor designers have proposed using speculative execution to hide the cache-miss latency. For example, if the processor encounters a stall condition, such as a cache miss, instead of waiting for the cache miss to be resolved, the processor generates a checkpoint and enters a scout mode. In scout mode, instructions are speculatively executed to prefetch future loads, but results are not committed to the architectural state of the processor. When the stall condition is finally resolved, the system uses the checkpoint to resume execution in normal-execution mode from the instruction that originally encountered the stall condition. By allowing the processor to continue to perform prefetches during stall conditions, scout mode can significantly increase the amount of work the processor completes.
Unfortunately, proposed systems that use scout mode do not always achieve optimal performance. In fact, there are operating conditions during which much of the performance benefit of scout mode is lost. One such condition occurs in processors that generate the checkpoint at the launch instruction (the instruction that caused the processor to enter scout mode) before commencing execution in scout mode. With this type of processor, the launch instruction is re-executed upon returning to normal-execution mode from scout mode. The cache line required by the launch instruction is re-read as the launch instruction is re-executed. A “live-lock” occurs when the required cache line is evicted during scout mode execution before returning to normal-execution mode.
For example, FIG. 1A illustrates a sequence of instructions that causes live-lock in such a processor. Note that the examples in both FIG. 1A and FIG. 1B assume: (1) a 2-way set-associative L1 cache (although both examples scale to any N-way set-associative cache); (2) that all the load instructions in these examples miss in the L1 cache, requiring a request to be sent to the L2 cache; and (3) that the addresses for the load instructions in these examples are associated with the same set within the L1 cache.
In FIG. 1A, the processor first executes LD ADDR_X, which misses in the L1 cache, causing a request for the cache line to be sent to the L2 cache. In order to avoid stalling, the processor generates a checkpoint at the LD ADDR_X instruction and commences execution in scout mode. As the processor executes the subsequent instructions in scout mode, the processor executes LD ADDR_Y, then LD ADDR_Z, and eventually LD ADDR_W. Because these load instructions all miss in the L1 cache, the processor generates a prefetch for each of the instructions. A short time after the processor sends the final prefetch (for LD ADDR_W) to the L2 cache, the cache line requested from the L2 cache for LD ADDR_X returns. The return of this cache line clears the stall condition, so the processor returns to the checkpoint to resume the execution of instructions in normal-execution mode. Unfortunately, before the processor can request the cache line required by LD ADDR_X after returning to normal-execution mode, the cache line is evicted by the return of one of the later scout mode prefetches (such as the prefetch for LD ADDR_Y). The cache line request for LD ADDR_X therefore misses in the L1 cache for a second time—again causing the processor to send a request to the L2 cache and commence execution in scout mode. This cycle can repeat indefinitely, trapping the processor in live-lock.
Another sub-optimal scout mode operating condition occurs when a processor runs for an extended time in scout mode, causing the prefetches sent early in scout mode to be evicted from the cache by later prefetches. FIG. 1B illustrates this “early-prefetch eviction” problem in a scout mode processor. Note that in FIG. 1B the processor commences execution in scout mode on the “use” of the result of the launch instruction, instead of on the launch instruction itself, but this example also applies in a processor that generates the checkpoint at the launch instruction.
In FIG. 1B, the processor first executes LD ADDR_X, which misses in the in the L1 cache, causing a request for the cache line to be sent to the L2 cache. The processor then continues to execute instructions in normal-execution mode until encountering the “SUB R6, R2” instruction, the first “use” of R6. At this instruction, the processor generates a checkpoint and commences execution in scout mode. As the processor executes a number of instructions in scout mode, the processor executes LD ADDR_Y, then LD ADDR_Z, and finally LD ADDR_W. Because all these load instructions miss in the L1 cache, the processor generates a prefetch for each of them. As these prefetches return, the later prefetches begin to overwrite the earlier prefetches. Consequently, when the processor resumes operation in normal-execution mode and attempts to load the prefetched cache lines, the lines which were prefetched early in scout mode have been evicted from the cache. The processor must then request these cache lines again from the L2 cache—repeating the work performed during scout mode.
Although the above example applies to cache lines, other processor state information can be subject to the same problem. For example, a branch prediction in the branch prediction table can be updated early in scout mode and this update can be overwritten by a later scout mode instruction.
Hence, what is needed is a method and an apparatus for avoiding the above-described problems during operation in speculative-execution mode.