1. Field of the Invention
The present invention generally relates to reordering memory operations in a processor in order to exploit instruction-level parallelism in programs and, more particularly, to an apparatus for the detection of incorrect execution of a memory load operation performed earlier than preceding (in program order) memory store operations. The invention is applicable to operations reordered when the program is generated (static reordering) as well as to operations reordered at execution time (dynamic reordering).
2. Background Description
Contemporary high-performance processors rely on superscalar, superpipelining, and/or very long instruction word (VLIW) techniques for exploiting instruction-level parallelism in programs; that is, for executing more than one instruction at a time. In general, these processors contain multiple functional units, execute a sequential stream of instructions, are able to fetch from memory more than one instruction per cycle, and are able to dispatch for execution more than one instruction per cycle subject to dependencies and availability of resources.
The pool of instructions from which the processor selects those that are dispatched at a given point in time is enlarged by the use of out-of-order execution. Out-of-order execution is a technique by which the operations in a sequential stream of instructions are reordered so that operations appearing later are executed earlier if the resources required by the operation are free, thus reducing the overall execution time of a program. Out-of-order execution exploits the availability of the multiple functional units by using resources otherwise idle. Reordering the execution of operations requires reordering the results produced by those operations, so that the functional behavior of the program is the same as what would be obtained if the instructions were executed in the original sequential order.
In the case of memory-related operations, a memory load operation reads a datum from memory, loads it in a processor register, and frequently starts a sequence of operations that depend on the datum loaded. Thus, in addition to using idle resources, the early (out-of-order) initiation of memory load operations may hide delays in accessing memory, including potential cache misses.
There are two basic approaches towards implementing out-of-order execution and reordering of results: dynamic reordering and static reordering. In dynamic reordering, the instructions are analyzed at execution time, and the instructions and results are reordered in hardware. In static reordering, a compiler/programmer analyzes and reorders the instructions and the results produced by those instructions when the program is generated, thus the reordering tasks are done in software. These two approaches can also be used jointly.
One factor that limits the ability to reorder operations is ambiguous memory references; this is the case when a memory load operation appears after a memory store operation in a sequential instruction stream, and it is not possible to determine ahead of time whether the memory locations accessed by the load and the store are different. For example, consider the following code fragment: EQU *X=(a+b+2)&lt;&lt;4 EQU r=((*Y)+c) d
wherein:
*X indicates the memory location whose address is contained in X; PA1 &lt;&lt; indicates a left-shift operation; and PA1 indicates an exclusive-or (XOR) operation. PA1 static reordering of code by the compiler; PA1 special hardware support to detect conflicts in memory references and manipulate data loaded out-of-order; and PA1 compiler-generated code for operating on the data loaded out-of-order and for recovering from the detection of conflicts.
Assuming that a, b, c, and d are values stored in registers r1 through r4 of a processor, and that X and Y are in registers r8 and r9, then this code fragment can be represented by the following instruction sequence (wherein the first register after the name of the instruction is the target register, and the remaining registers are the operands):
______________________________________ add r10,r1,r2 ; r10 = a+b add r11,r10,2 ; r11 = a+b+2 shift.sub.-- left r12,r11,4 ; r12 = a+b+2&lt;&lt;4 store r12,(r8) ; *X = a+b+2&lt;&lt;4 load r20,(r9) ; r20 = *Y add r21,r20,r3 ; r21 = *Y+c xor r22,r21,r4 ; r22 = (*Y+c) d ______________________________________
If it can be determined that X and Y are different, then the two expressions can be scheduled for parallel execution, yielding a sequence like (wherein the symbol .parallel. denotes parallel execution):
______________________________________ add r10,r1,r2 .parallel. load r20,(r9) add r11,r10,2 shift.sub.-- right r12,r11,4 .parallel. add r21,r20,r3 store r12,(r8) .parallel. xor r22,r21,r4 ______________________________________
In a machine with two execution units, the sequence above would take 4 cycles to complete (assuming that a load takes two cycles, and other operations take a single cycle).
On the other hand, if it cannot be determined that X and Y are always different, i.e. the addresses are ambiguous, then the two expressions should be scheduled in the original order, taking 8 cycles (assuming again that a load takes two cycles).
The example above is not atypical; ambiguity in memory references limits performance fairly severely by forcing the sequential execution of operations that could otherwise be executed in parallel. However, such a serialization can be avoided (that is, the load operation can be performed earlier than the store operation) as long as the store operation does not interfere with the load operation. The operations interfere whenever the memory locations accessed by the store operation and the out-of-order load operation overlap. Moreover, if the store operation and the out-of-order load operation do not interfere, any operation that depends on the datum loaded out-of-order can also be performed out-of-order. On the other hand, if the operations interfere, the datum loaded out-of-order and any results derived from it are invalid, making it necessary to re-execute the load operation after the store operation, as well as the associated dependent operations.
Various attempts have been made towards solving the problem of reordering memory operations with ambiguous references by processors. These schemes detect interference by comparing the address of the memory location accessed by an out-of-order load with the addresses of the memory locations accessed by succeeding store operations, within a window of execution determined by the extent of the reordering of the load operation. If the addresses overlap, then it is assumed that the operations interfere, so the load operation (and those operations that depend on the load which have already been executed, if applicable) must be re-executed. That is, the mechanisms monitor whether there has been any modifications to the memory location containing a datum loaded out-of-order by keeping track of memory addresses. The detection is performed either by extra instructions (software-based schemes), or by dedicated hardware resources (hardware-based schemes) sometimes with software assistance.
For example, in the case of software-based detection of interference, the code fragment given earlier could be modified as follows:
______________________________________ r = ((*Y)+c) d *X = (a+b+2)&lt;&lt;4 if (X == Y) /* compare the addresses */ r = ((*Y)+c) d endif ______________________________________
That is, the program statements could be reordered so that the load operation implied by *Y is performed earlier than the store operation implied by *X; additional statements are introduced for comparing the addresses of the memory locations referenced by the load and the store operations, and for re-executing the statement containing the load operation whenever the addresses match.
In the case of static reordering, the sequence of instructions generated by the compiler/programmer differs among the various schemes proposed for dealing with ambiguous memory references. Usually, a load instruction which has been moved above a store instruction is replaced by some new instruction or instruction sequence which performs the load operation and sets up a mechanism for monitoring the addresses used by store instructions; another instruction, or an instruction field in the out-of-order load instruction, is used to indicate the place where the load instruction was originally located, which determines the end of the range of monitoring for interfering store operations.
In the case of dynamic reordering, load and store instructions are presented to the processor in program order, that is, the store instruction followed by the load instruction. The processor reorders the instructions, marks the load instruction as an out-of-order operation, sets up a mechanism for detecting interference from store operations (which includes the identification of the range of monitoring), and recovers the correct state of the processor when interference is detected.
This invention follows the approach of hardware-based detection of interference among out-of-order load and store operations, with a mechanism for recovering from the case of incorrectly reordered memory operations. A summary of relevant related art in the field is now set forth.
A method and apparatus for improving the performance of out-of-order operations is described by M. Kumar, M. Ebcioglu, and E. Kronstadt in their patent application entitled "A method and apparatus for improving performance of out-of-sequence load operations in a computer system," Ser. No. 08/320,111 filed Oct. 7, 1994, as a continuation of application Ser. No. 07/880,102 filed May 6, 1992, and assigned to the assignee of this application. This method and apparatus uses compiler techniques, four new instructions, and an address-compare unit. The compiler statically moves memory load operations over memory store operations, marking them as out-of-order instructions. The addresses of operands loaded out-of-order are saved in an associative memory. On request, the address-compare unit compares the addresses saved in the associative memory with the address generated by store operations. If a conflict is detected, recovery code is executed to correct the problem. The system clears addresses saved in the associative memory when there is no longer a need to compare those addresses. This approach is hardware-intensive, and also requires special instructions to trigger the checking for conflicts in addresses as well as to clear the address of an operand no longer needed.
U.S. patent application Ser. No. 08/435,411, filed May 10, 1995, in the name of Ebcioglu et al., assigned to the assignee of the application, combines reordering of memory operations with speculative execution of memory operations. The reordering of the memory operations relies on:
The special hardware support consists of an address register for each register which can be the destination for the result of a load operation executed out-of-order, a comparator associated with each such address register, and special instructions to load a datum out-of-order and to "commit" such datum as well as any other values derived from it at in-order points in the program. Each out-of-order load records in the corresponding address register the memory address and size of the datum loaded; each store operation triggers the comparison of the (address, size) tuple against the contents of all address registers. If any such comparison is true, then the corresponding address register is marked as invalid. A special commit instruction is executed at the in-order point of the load instruction, which checks whether the associated address register is valid; if so, the datum loaded out-of-order and the datum in memory are coherent. On the other hand, if the address register is invalid, then the datum loaded out-of-order and the memory contents are not coherent, so that the load operation as well as any other operation must be re-executed. A trap is invoked at that time, transferring execution control to recovery code produced by the compiler which re-executes the load operation as well as the dependent operations.
U.S. Pat. No. 5,421,022 entitled "Apparatus and method for speculatively executing instructions in a computer system" issued on May 30, 1995 in the name of F. McKeen et al. describes an apparatus usable in the case of statically reordered ambiguous memory operations, which relies on content-addressable memories (CAM) to compare the address of every executed store operation with the address of every outstanding out-of-order load instruction. If an overlap is detected, the apparatus treats the out-of-order load as if it caused an exception, effectively causing the re-execution of the load operation at its in-order point, in its in-order (or precise) state. Similarly, U.S. Pat. No. 5,420,990 entitled "Mechanism for enforcing the correct order of instruction execution," also issued on May 30, 1995 in the name of F. McKeen et al., describes an apparatus closely related to the one proposed in U.S. Pat. No. 5,421,022 but usable in the case of memory operations reordered dynamically by the processor; this apparatus also relies on content-addressable memories.
A method and apparatus for reordering load instructions is described in the patent application entitled "Memory processor that permits aggressive execution of load instructions" by F. Amerson, R. Gupta, V. Kathal and M. Schlansker (UK Patent Application GB 2265481A, No. 9302148.3, filed on Apr. 2, 1993). This patent application describes a memory processor for a computer system in which a compiler moves long-latency load instructions earlier in the instruction sequence, to reduce the loss of efficiency resulting from the latency of the load. The memory processor saves load instructions in a special register file for a period of time sufficient to determine if any subsequent store instruction that would have been executed prior to the load references the same address as that specified by the load instruction. If so, the memory processor reinserts the original load in the instruction stream so that it gets executed in-order. Thus, this system permits moving loads ahead of stores under compiler control, and relies on hardware to insert code to recover from a conflict. However, this system does not permit reordering other instructions that depend on the load (the hardware resources are able to reinsert only the load instruction). In other words, the method and apparatus is limited to hiding the latency of load instructions, whose maximum value must be known at compile time.
The article by K. Diefendorff and M. Allen entitled "Organization of the Motorola 88110 superscalar RISC microprocessor," IEEE Micro, April 1992, pp. 40-63, describes the dynamic scheduler in the Motorola 88110 processor which dispatches store instructions to a store queue where the store operations might stall if the operand to be stored has not yet been produced by another operation. Subsequent load instructions can bypass the store operations and immediately access the memory, achieving dynamic reordering of memory accesses. An address comparator detects address hazards and prevents load operations from going ahead of store operations to the same address. The queue holds three outstanding store operations. The structure does not really move a load earlier in the sequential execution stream; instead, it only allows for a load operation not to be delayed as a result of a stalled store operation.
3. Problems with State of the Art
Software-based techniques to detect interference among reordered ambiguous memory operations suffer from large overhead, in the form of additional instructions that must be executed. Specifically, a load instruction needs to be checked against every ambiguous store instruction over which it is moved. For example, consider the case of moving a load instruction over several store instructions as in the following sequence:
______________________________________ store r7,(r21) store r8,(r22) store r9,(r23) load r15,(r25) ______________________________________
In this case, the interference test requires comparing the address in register r25 with the addresses in registers r21, r22 and r23. Thus, the interference test requires at least five instructions, and may require many more (depending on the primitives for performing the comparisons and for combining several comparisons).
Moreover, if load and store instructions are byte-aligned (i.e. load and store instructions access data at any byte boundary in memory, and the data accessed is more than one byte), or if load and store instructions access entities of different size (different number of memory bytes), then the test is more complicated. Instead of checking just for equality in the addresses, the interference test must check for address overlap. Thus, assuming for example that rY contains the address used in an out-of-order load instruction and rX contains the address used in a succeeding store instruction, then the test will consist of checking that rY-rX is less than the number of bytes stored by the store instruction, and that rX-rY is greater or equal than the number of bytes accessed by the load instruction.
Hardware-assisted or hardware-only options for detecting interference among reordered ambiguous memory references avoid the overhead arising from executing extra instructions by saving the memory address used by out-of-order load instructions in special hardware resources (comparator registers), and continually checking the contents of those registers for overlap against the addresses of store instructions.
The resources required for hardware monitoring are complex and expensive. In every cycle, such resources must compare the address of each store operation issued in that cycle (assuming that one or more operations can be issued simultaneously) with all outstanding out-of-order load operations (i.e., those that have not yet reached their in-order point). This functionality can be achieved by using content-addressable memories, special register files, or multiple comparators, as illustrated by the examples of prior art given earlier. However, such hardware resources can only save (and compare against) a fixed number of out-of-order load addresses at any one point. Usually, this is a small number, so that only a limited (fixed) number of load operations can be executed out-of-order at any point in time. Such a fixed bound implies that an out-of-order load instruction cannot be issued as soon as a load unit becomes available to execute it; instead, the address checking hardware must also have resources available to save the address generated. This limitation adds complexity to the dispatch mechanism in the case of dynamic reordering, or restricts the number of ambiguous load instructions that can be moved out-of-order in the case of static reordering (i.e., the compiler must ensure that, at any given time, no more ambiguous load instructions have been moved over store instructions than the number of monitors available).