1. Technical Field
The present invention relates to methods for processing load operations, and in particular to methods for processing load operations prior to store operations that may target overlapping memory addresses.
2. Background Art
Currently available processors are capable of executing instructions at very high speeds. These processors typically implement pipelined, superscalar micro-architectures that can execute multiple instructions per clock cycle at clock frequencies approaching one gigahertz or more. In recent years, the instruction executing capabilities of processors have begun to outstrip computer systems"" capacities to provide instructions and/or data for processing.
One bottleneck in supplying the processor with data/instructions is the relatively long latency of the load operations that transfer data from the computer""s memory system into the processor""s registers. A typical memory system includes a hierarchy of caches, e.g. L0, L1, L2 . . . , and a main memory. The latency of the load depends on where in the hierarchy the targeted data is found, i.e. the cache in which the load operation xe2x80x9chitsxe2x80x9d. For example, a load hit in the L0 cache may have a latency of 1 to 2 clock cycles. Load hits in the L1 or L2 caches may have latencies of 4 to 8 clock cycles or 10 or more clock cycles, respectively. If the data is only available from main memory, the load latency can be on the order of 100-200 clock cycles.
To avoid idling the processor, a compiler typically schedules load operations in a program flow well before the operation that uses the target data. Compiler scheduling occurs before the program is executed and, consequently, before any run-time information is available. As a result, store operations, which transfer data from the processor""s registers into the memory system, can limit this load-scheduling strategy. If a compiler moves a load that returns data from a specified memory address ahead of a store that writes data to the same memory address, the load will return stale data. As long as the compiler can determine the memory addresses specified by the load and store from available information, it can determine whether it is safe to move the load ahead of the store. The process of identifying memory addresses to determine overlap is referred to as memory disambiguation.
In many instances, it is not possible to disambiguate memory references at the time the corresponding load and store operations are scheduled. For example, the memory address referenced by an operation may depend on variables that are determined at run-time, just before the operation is executed. For load/store pairs that can not be disambiguated at compile time, certain advanced compilers can still reschedule the load ahead of the store using an xe2x80x9cadvanced loadxe2x80x9d. In an advanced load, the load operation is scheduled ahead of a potentially conflicting store operation, and a check operation is inserted in the instruction flow, following the store operation. The load and store memory references are resolved when the corresponding instructions are executed. The check operation determines whether these dynamically-resolved memory references overlap and initiates a recovery procedure if the resolved memory references overlap.
The instruction movement that accompanies an advanced load operation is illustrated by the following instruction sequence, where LOAD, STORE, ALOAD, and CHECK represent the load, store, advanced load, and check operations, and x and y represent the undisambiguated memory references.
The advanced load adds a check operation to the program flow. The check operation takes time to complete, which can delay the time at which the ADD instruction (and any other instructions that depend on the load) is retired. Typically, operations that need to be executed fast are implemented in hardware, since operations implemented on specially designed hardware tend to be faster than those implemented by software on a general purpose processor. In the above example, a fast check operation is necessary to avoid offsetting any latency advantage provided by the advanced load. However, hardware solutions place additional burdens on the already limited die area available on modem processors.
The present invention addresses these and other problems related to processing advanced load operations.
The present invention provides a mechanism for implementing advanced load operations without the need for significant additional hardware support.
In accordance with the present invention, an advanced load is implemented by processing a first load operation to a memory address. The first load operation is subsequently checked by comparing data in a register targeted by the first load operation with data currently at the memory address.
For one embodiment of the invention, a second load operation targets data currently at the memory address, and the data returned by the second load is compared with the data provided by the first load. The load and check operations may be scheduled by a compiler, or they may be micro-operations that are scheduled on the fly by a processor.