1. Technical Field
The present invention relates to methods for processing load operations, and in particular to methods for processing load operations prior to store operations that may target overlapping memory addresses.
2. Background Art
Currently available processors are capable of executing instructions at very high speeds. These processors typically implement pipelined, superscalar micro-architectures that can execute multiple instructions per clock cycle at clock frequencies approaching one gigahertz or more. In recent years, the instruction executing capabilities of processors have begun to outstrip computer systems"" capacities to provide instructions and/or data for processing.
One bottleneck in supplying the processor with data/instructions is the relatively long latency of the load operations that transfer data from the computer""s memory system into the processor""s registers. A typical memory system includes a hierarchy of caches and a main memory. The latency of the load depends on where in the hierarchy the targeted data is found, i.e. the cache in which the load operation xe2x80x9chitsxe2x80x9d. For example, a load hit in the primary or first level cache, i.e. the cache closest to the processor core, may have a latency of 1 to 2 clock cycles. Load hits in higher level caches further from the processor core have larger latencies. For example, the secondary and tertiary caches may have latencies of 4 to 8 clock cycles or 10 or so more clock cycles, respectively. If the data is only available from main memory, the load latency can be on the order of 100-200 clock cycles.
To avoid idling the processor, a compiler typically schedules load operations in a program flow well before the operation that uses the target data. Compiler scheduling occurs before the program is executed and, consequently, before any run-time information is available. As a result, store operations, which transfer data from the processor""s registers into the memory system, can limit this load-scheduling strategy. If a compiler moves a load that returns data from a specified memory address ahead of a store that writes data to the same memory address, the load will return stale data. That is, the load will not observe the effects of the store that preceded the load in execution order. As long as the compiler can determine the memory addresses specified by the load and store from available information, it can determine whether it is safe to move the load ahead of the store. The process of identifying memory addresses to determine overlap is referred to as memory disambiguation.
In many instances, it is not possible to disambiguate memory references at the time the corresponding load and store operations are scheduled. For example, the memory address referenced by an operation may depend on variables that are determined at run-time, just before the operation is executed. For load/store pairs that can not be disambiguated at compile time, certain advanced compilers can still reschedule the load ahead of the store using an xe2x80x9cadvanced loadxe2x80x9d. In an advanced load, the load operation is scheduled ahead of a potentially conflicting store operation, and a check operation is inserted in the instruction flow, following the store operation. The load and store memory references are resolved when the corresponding instructions are executed. The check operation determines whether these dynamically-resolved memory references overlap and initiates a recovery procedure if they do.
The instruction movement that accompanies an advanced load operation is illustrated by the following instruction sequence, where LOAD, STORE, ALOAD, and LOAD CHECK represent the load, store, advanced load, and check operations, and x and y represent the undisambiguated memory references.
The advanced load adds a check operation (LOAD CHECK) to the program flow. The check operation takes time to complete, which can delay the time at which the ADD instruction (and any other instructions that depend on the load) is retired. To fully realize the benefits of advanced loads, a processor must provide efficient mechanisms to implement the operations necessary to support advanced loads. These operations include, for example, checking for a load store conflict, and when a conflict is detected, canceling any instructions that may have used the resulting stale data, retrieving the updated data, and re-executing the canceled instructions. Delays due to inefficiencies in any of these operations can offset the benefits provided by advancing loads.
The present invention addresses these and other problems related to processing advanced load operations.
The present invention provides an efficient mechanism for recovering from a failed load check operation.
In accordance with the present invention, a first load operation is executed to a memory address. A subsequent, load check operation checks the status of the load operation at a table entry associated with the memory address. The load check operation is converted to a load operation if the status indicates that data returned by the first load operation is stale, and a recovery operation is implemented.
For one embodiment of the invention, the recovery operation is implemented as a micro-architectural trap. The trap flushes the instruction pipeline and resteers the processor to an instruction following the check load operation.