1. Technical Field
The present invention generally relates to computer processing systems and, in particular, to method and apparatus for reordering load operations in a computer program. The invention is applicable to operations reordered when the program is generated (static reordering) as well as to operations reordered at execution time (dynamic reordering).
2. Background Description
Contemporary high-performance processors rely on superscalar, superpipelining, and/or very long instruction word (VLIW) techniques for exploiting instruction-level parallelism in programs (i.e., for executing more than one instruction at a time). In general, these processors contain multiple functional units, execute a sequential stream of instructions, are able to fetch from memory more than one instruction per cycle, and are able to dispatch for execution more than one instruction per cycle subject to dependencies and availability of resources.
The pool of instructions from which the processor selects those that are dispatched at a given point in time is enlarged by the use of out-of-order execution. Out-of-order execution is a technique by which the operations in a sequential stream of instructions are reordered so that operations appearing later are executed earlier, if the resources required by the later appearing operations are free. Thus, out-of-order execution reduces the overall execution time of a program by exploiting the availability of the multiple functional units and using resources that would otherwise be idle. Reordering the execution of operations requires reordering the results produced by those operations, so that the functional behavior of the program is the same as what would be obtained if the instructions were executed in their original sequential order.
In the case of memory-related operations, a memory load operation reads a datum from memory, loads it in a processor register, and frequently starts a sequence of operations that depend on the datum loaded. Thus, in addition to using idle resources, the early (out-of-order) initiation of memory load operations may hide delays in accessing memory, including potential cache misses.
In general, there are two basic approaches to implementing out-of-order execution and reordering of results: dynamic reordering and static reordering. In dynamic reordering, the instructions are analyzed at execution time, and the instructions and results are reordered in hardware. In static reordering, a compiler/programmer analyzes and reorders the instructions and the results produced by those instructions when the program is generated, thus the reordering tasks are accomplished through software. These two approaches can be jointly implemented.
While significant research has been performed to support out-of-order execution in general, and between memory operations in particular, such research has primarily concentrated on uniprocessor execution. This in turn has focused on the ordering between load and (synchronous) store operations in a single instruction stream designed to execute on a single processor. This invention deals with the problem of asynchronous memory references typically found in multiprocessing environments, and their impact on reordering between multiple read operations from a single memory cell. While such transformations (i.e., reorderings) are safe in a strict uniprocessor environment, multiprocessor environments pose additional considerations such as, for example, the possibility of writes being performed by another processor with a distinct, unknown instruction stream.
To achieve predictable and repeatable computation of programs, a requirement of `sequential consistency` is described in the article by L. Lamport, "How to Make a Multiprocessor that Correctly Executes Multiprocess Programs", IEEE Transactions on Computers, C-28(9), pp. 690-91 (September 1979). The article by Lamport defines a multiprocessor system as sequentially consistent if "the result of any execution is the same as if the operations of all processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program". For static speculative execution, the order of the original logical program text is authoritative, not the reordered program text, and the compiler and hardware implementation must collaborate to generate an execution equivalent to that original order.
To achieve proper performance while simplifying coherence protocols between multiple processors in a system, several relaxations of the above described strictly sequential consistent order are possible. The types of re-ordering which are allowable depend on the memory consistency model guaranteed by a particular implementation. An overview of currently used and proposed consistency models and their characteristics is described in the article by S. Adve and K. Gharachorloo, "Shared Memory Consistency Models: A Tutorial", Technical Report 9512, Dept. of Electrical and Computer Engineering, Rice University, Houston, Tex. (September 1995).
A basic requirement in all these relaxed models is write serialization to achieve memory coherence, i.e., all writes to the same location are serialized in some order and are performed in that order with respect to any processor. This is equivalent to sequential consistency described by Lamport wherein each memory cell is considered a memory module. We will refer to sequential consistency with respect to a single memory cell as write serial, so as to differentiate it from sequential consistency for a larger memory module. Memory coherence can be achieved by ensuring that successive load operations to the same memory location preserve a weakly ascending order of data items presented in that memory location. Thus, in a sequence of load operations, any load may present only the same or a later data item as its predecessors.
For example, consider a sequence of data items d1, d2, d3, d4, d5, d6 and so forth, written into a given memory location by a second processor. Successive load operations from that memory location by a first processor may present the same data item as returned by the first load operation, or a later item present in that memory cell. Thus, if the first load operation returned data item d2, then the second load operation may return data items d2, d3, d4 and so forth, but not data item d1. Alternatively, if the first load operation returned data item d4, then the second load operation may return data items d4, d5, d6 and so forth, but not data items d1, d2, or d3.
It is evident that in serial execution, this problem is resolved automatically by the nature of time moving forward. However, with respect to out-of-order processors, load operations which access the same memory location may get out-of-order such that a statically later load instruction would read a data item earlier in the sequence of data items than its statically preceding load instruction, which is executed at a later time.
One factor that limits the ability to reorder operations is ambiguous memory references. This is the case when a memory load operation appears after another memory load operation in a sequential instruction stream, and it is not possible to determine ahead of time whether the memory locations accessed by an out-of-order and an in-order load operation are different. For example, consider the following code fragment:
s=*(X+a*4+5) PA1 u=s+4 PA1 t=*Y PA1 v=t+8 PA1 second processor writes datum d1 to memory location referenced by load into s and t PA1 t=*Y PA1 v=t+8 PA1 s=*(X+a*4+5) PA1 u=s+4 PA1 if (Y==(X+a*4+5))/*compare addresses*/ PA1 endif PA1 static reordering of code by the compiler to exploit instruction-level parallelism; PA1 special hardware support to detect conflicts in memory references and manipulate data loaded out-of-order; and PA1 compiler-generated code for operating on the data loaded out-of-order and for recovering from the detection of conflicts. PA1 t=*y PA1 v=t+8 PA1 s=*(X+a*4+5) PA1 u=s+4 PA1 if (Y==(X+a*4+5))/*compare addresses*/+ PA1 endif+ PA1 selecting and moving a next instruction from its current position in a sequence of instructions to an earlier position; PA1 determining whether the selected instruction may reference a memory location for read-access; PA1 determining whether non-selected instructions, which may ambiguously reference the memory location for read-access, were previously moved over the selected instruction, when the selected instruction may reference the memory location for read-access; PA1 establishing a bypass sequence to be performed during an execution of the selected instruction and which passes data previously read-accessed by the non-selected instructions to the selected instruction, when the non-selected instructions were previously moved over the selected instruction and addresses of memory locations from which the non-selected instructions have read-accessed the data are the same as an address of the memory location from which the selected instruction is to read-access data; PA1 determining whether the selected instruction was previously moved over the non-selected instructions, when the selected instruction may reference the memory location for read-access; and PA1 adding a mechanism for storing a record of the selected instruction for future reference by the non-selected instructions. PA1 executing the out-of-order load instruction to control the at least one processor unit to at least read-access a first datum from the memory location identified by the out-of order load instruction; PA1 creating a record of the out-of-order load instruction for use by the at least one other load instruction, wherein the record includes an address of the memory location from which the first datum was loaded and a value associated with the first datum; and PA1 executing the at least one other load instruction and controlling the at least one processor unit during the executing of the at least one other load instruction to perform the following steps:
wherein * denotes a memory access to the specified address, such that: PA2 t=*Y PA2 v=t+8 PA2 t=(*Y)* PA2 v=t+8* PA2 determining whether the address of the memory location from which the out-of-order instruction loaded the first datum is overlapping or the same as an address from which the at least one other load instruction is to load a datum; and PA2 passing the first datum or a portion thereof from the record to the at least one other load instruction, when the addresses are the same or overlapping, respectively.
*Y indicates the memory location whose address is contained in Y; and PA3 *(X+a*4+5) indicates the memory location whose address is specified by the expression X+a*4+5.
Assuming that a is a value stored in register r1 of a processor, X and Y are in registers r2 and r9, and s, t, u and v are assigned to registers r4, r5, r6 and r7, respectively, then the above code fragment can be represented by the following instruction sequence (wherein the first register after the name of the instruction is the target register, and the remaining registers are the operands):
mul r10, r1, 4 ; r10 = a*4 add r11, r10, 5 ; r11 = a*4+5 add r12, r11, r2 ; r12 = X+a*4+5 load r4, (r12) ; s = *(X+a*4+5) add r6, r4, 4 ; u = s + 4 load r5, (r9) ; t = *Y add r7, r5, 8 ; v = t + 8
If it can be determined that X+a*4+5 and Y refer to different addresses, then the four expressions can be scheduled for parallel execution, yielding, for example, the following sequence (wherein the symbol .linevert split..linevert split. denotes parallel execution):
 mul r10, r1, 4 .vertline..vertline. load r5, (r9) add r11, r10, 5 .vertline..vertline. ... add r12, r11, r2 .vertline..vertline. add r7, r5, 8 load r4, (r12) .vertline..vertline. ... ... .vertline..vertline. ... add r6, r4, 4 .vertline..vertline. ...
In a machine with two execution units, the sequence above would take 6 cycles to complete (assuming that a load takes two cycles, and other operations take a single cycle).
On the other hand, if it cannot be determined whether X+a*4+5 and Y are always different (i.e., the addresses are ambiguous), then the two expressions would have to be scheduled in the original order, taking 9 cycles to complete (again, assuming that a load takes two cycles, and other operations take a single cycle).
If both load addresses reference the same memory location, and that memory location receives the data item sequence of d1 followed by d2, a total of four combinations of read-accesses for variables s and t are possible. Of these, the first three combinations shown below are write serialized, whereas the fourth combination does not satisfy the requirement of write serialization.
 write write write not write serial serial serial serial first load operation (s) d1 d1 d2 d2 second load operation (t) d1 d2 d2 d1
Note how in the following example the reordered scheme of instructions will cause the actual user program which reads variable s followed by variable t to see a sequence of a later data item d2 preceding an earlier data item d1 if both load operations reference the same memory location. This is a significant problem with respect to synchronizing multiprocessors, or communicating with DMA devices:
 -- second processor writes datum d1 to memory location referenced by load into s and t -- mul r10, r1, 4 .vertline..vertline. load r5, (r9) add r11, r10, 5 .vertline..vertline. ... -- second processor modifies datum to d2 -- add r12, r11, r2 .vertline..vertline. add r7, r5, 8 load r4, (r12) .vertline..vertline. ... ... .vertline..vertline. ... add r6, r4, 4 .vertline..vertline. ...
This has the net effect of loading d2 into s and d1 into t, which is not consistent with the write serialized sequence of values which was actually stored by the second processor in the memory location accessed by both load operations.
The example above is not atypical. Ambiguity in memory references severely degrades system performance by forcing the sequential execution of operations that could otherwise be executed in parallel. However, such serialization of instructions can be avoided (that is, a logical successor load operation can be performed earlier than a logically preceding load operation) as long as the sequence of load result values perceived by the user program is write serialized. Thus, the out-of-order load operation performed earlier than the in-order load operation is valid as long as the data sequence of the load operations in their original program order is consistent with the corresponding data sequence in memory (i.e., each successor load returns the same value or a value occurring later on the time line with respect to all logically preceding read operations). Moreover, if these values are consistent, then any operation that depends on the datum loaded out-of-order can also be performed out-of-order. On the other hand, if the values are not consistent, then the datum loaded out-of-order and any results derived from it are invalid, making it necessary to re-execute the load operation at the in-order point as well as the associated dependent operations.
Various attempts have been made towards solving the problem of reordering memory operations with ambiguous references by processors. Most of these schemes assume that instructions are reordered statically (i.e., when the programs are generated). All these schemes rely on detecting interference through either address comparison or load result value comparison. If interference is detected, then the out-of-order load operation scheduled to execute before the in-order load operation (and those operations that depend on the load which has already been executed, if applicable) must be re-executed at its original in-order point. That is, the mechanisms enforce write serialization by re-executing all interfering load operations in-order. Interference detection and re-execution are performed either by extra instructions (software-based schemes), or by dedicated hardware resources (hardware-based schemes) sometimes with software assistance.
To ensure correctness when addresses overlap, the existing mechanisms: recognize that a load instruction previously executed (i.e., an out-of-order load instruction) interfered with another load instruction (i.e., an in-order load instruction); and re-execute the previously executed out-of-order load instruction, and any instructions that depend on the load instruction which has already been executed (i.e., the out-of-order load instruction).
For example, the code fragment given earlier could be modified as follows:
In the case of static reordering, the sequence of instructions generated by the compiler/programmer differs among the various schemes proposed. Usually, a load instruction which has been moved over another load instruction is replaced by some new instruction (or instruction sequence) which performs the load operation and starts monitoring the addresses used by other load instructions. Another instruction (or an instruction field in the out-of-order load instruction) is used to indicate the place where the moved load instruction was originally located, which determines the end of the range of monitoring for interfering store operations.
In the case of dynamic reordering, the various load instructions are presented to the processor in program order, that is, the first load instruction is followed by the second load instruction. The processor reorders the instructions and, as in the case of static reordering, the processor must be able to detect if the first load instruction loads a memory location read by the second, out-of-order load operation which has already been executed. Thus, the processor must mark the load instruction as an out-of-order operation, set up a mechanism for detecting interference between out-of-load operations with respect to other load operations, recover the state of the processor when interference is detected, and re-execute the out-of-order load instruction as well as any other instructions which are dependent on the out-of-order load operation.
A summary of related art dealing with asynchronous memory operations in a multiprocessor environment when reordering memory load operations is now set forth.
A support mechanism for out-of-order load operations based on the detection of interference and the re-issuance of previously executed out-of-order load instructions is disclosed in U.S. Ser. No. 08/829,669, filed Mar. 31, 1997, entitled "Support for Out-Of-Order Execution of Loads and Stores in a Processor", and assigned to the assignee herein. The mechanism enters the address of each out-of-order load operation in a queue ("load-hit-load queue") until the original program location of the out-of-order load is reached. Other (in-order) load operations verify their addresses against the entries in the load-hit-load queue and, if interference is detected, then the interfering out-of-order load (and all depending operations) are re-issued by the processor.
An alternative detection mechanism is described in the context of load/store interference detection based on load data verification in U.S. Pat. No. 5,758,051, issued May 26, 1998, entitled "Method and Apparatus for Reordering Memory Operations in a Processor", and assigned to the assignee herein. In this approach, data items accessed by an out-of-order load operation are read in-order, and the result of the in-order load operation is compared to the out-of-order result. If the two values are identical, then no detectable interference has occurred and the program continues execution. However, if the two values are not identical, then the value returned by the in-order load operation is used to re-execute all dependent instructions. Note that for interference detection based on load verification, if interference cannot be detected, then it is presumed that no interference exists. This approach reduces the amount of hardware necessary to monitor interference and the number of re-executions, but requires additional bandwidth to perform a second in-order load for every load operation moved out-of-order.
U.S. Pat. No. 5,625,835, issued on Apr. 29, 1997, entitled "Method and Apparatus for Reordering Memory Operations in a Superscalar or Very Long Instruction Word Processor", and assigned to the assignee herein, combines reordering of memory operations with speculative execution of memory operations. The reordering of memory operations relies on:
The special hardware support consists of an address register for each register which can be the destination for the result of a load operation executed out-of-order, and a comparator associated with each such address register. `Special instructions` are used to load a datum out-of-order and to `commit` such datum as well as any other values derived from it at in-order points in the program. Each out-of-order load records in the corresponding address register the memory address and size of the datum loaded; each store operation triggers the comparison of the (address, size) tuple against the contents of all address registers. In a multiprocessor embodiment, the processing unit responsible for memory disambiguation receives all asynchronous store requests issued by other processors, and invalidates all out-of-order load operations which interfere with such store requests. Thus, if the (address,size) tuple in the corresponding register matches another (address,size) tuple in another register, then the corresponding address register is marked as invalid. A special commit instruction is executed at the in-order point, which checks whether the associated address register is valid; if so, then the datum loaded out-of-order and the datum in memory are coherent. On the other hand, if the address register is invalid, then the datum loaded out-of-order and the memory contents are not coherent. Thus, the load operation as well as any other operation dependent therefrom must be re-executed. A trap is invoked at that time, transferring execution control to recovery code produced by the compiler which re-executes the load operation as well as the dependent operations.