1. Field of the Invention
The present invention relates generally to computer processing systems and, in particular, to method and apparatus for reordering load operations along multiple execution paths in a computer program. The invention is applicable to operations reordered when the program is generated (static reordering) as well as to operations reordered at execution time (dynamic reordering).
2. Background Description
Contemporary high-performance processors rely on superscalar, superpipelining, and/or very long instruction word (VLIW) techniques for exploiting instruction-level parallelism in programs (i.e., for executing more than one instruction at a time). In general, these processors contain multiple functional units, execute a sequential stream of instructions, are able to fetch from memory more than one instruction per cycle, and are able to dispatch for execution more than one instruction per cycle subject to dependencies and availability of resources.
The pool of instructions from which the processor selects those that are dispatched at a given point in time is enlarged by the use of out-of-order execution. Out-of-order execution is a technique by which the operations in a sequential stream of instructions are reordered so that operations appearing later are executed earlier, if the resources required by the later appearing operations are free. Thus, out-of-order execution reduces the overall execution time of a program by exploiting the availability of the multiple functional units and using resources that would otherwise be idle. Reordering the execution of operations requires reordering the results produced by those operations, so that the functional behavior of the program is the same as what would be obtained if the instructions were executed in their original sequential order.
In the case of memory-related operations, a memory load operation reads a datum from memory, loads it in a processor register, and frequently starts a sequence of operations that depend on the datum loaded. Thus, in addition to using idle resources, the early (out-of-order) initiation of memory load operations may hide delays in accessing memory, including potential cache misses.
In general, there are two basic approaches to implementing out-of-order execution and reordering of results: dynamic reordering and static reordering. In dynamic reordering, the instructions are analyzed at execution time, and the instructions and results are reordered in hardware. In static reordering, a compiler/programmer analyzes and reorders the instructions and the results produced by those instructions when the program is generated, thus the reordering tasks are accomplished through software. These two approaches can be jointly implemented.
In current processor designs, instructions which can be executed out-of-order typically come from a single instruction stream which is usually the most likely path. The likeliness of a path can be determined by a number of methods, such as static or dynamic branch prediction, runtime profiling, or synthetically generated probabilities. Instructions from the most likely path are executed out-of-order and, if this path is taken, then the execution time of the path is reduced. However, if another, less likely path is taken, then no benefit is obtained by the out-of-order execution. In fact, the out-of-order execution of instructions on the predicted path may actually increase the runtime if the prediction is incorrect. This execution approach is referred to as a xe2x80x9csingle pathxe2x80x9d execution approach, since out-of-order execution occurs along a single predicted path which is deemed most likely.
Thus, effective out-of-order execution using the single path approach requires that the predictability of the most likely path be sufficiently high to generate any gains. Note that even though the predictability of a single branch may be deemed high, following the execution of several branches progressively degrades the probability that a given point on the path may be reached.
To reduce the risk of mispredicting the path and losing all of the gains from out-of-order execution, it is more advisable to make instructions from multiple paths available for execution. Various schemes and designs have been described to either statically or dynamically select instructions from multiple paths for out-of-order execution. These approaches are referred to as xe2x80x9cmulti pathxe2x80x9d approaches, since out-of-order execution occurs along multiple likely paths of execution.
Consider the following code fragment:
a[16]=0;
if (i less than j)
t=a[i]+1;
else
t=a[j]+1;
In a single path approach, it has to be determined whether the condition i less than j is more likely to be true or false over the whole execution of the program. When i less than j is true, out-of-order execution and speculation may generate the following code:
txe2x80x2=a[i]+1;
a[16]=0;
if (i less than j)
t=txe2x80x2;
else
t=a[j]+1;
If a sufficient number of execution units are available, execution of this code may be improved for the expected path. However, if the alternate path is taken, no code improvements are achieved.
In a multipath scheme, instructions from both branches of the if statement can be moved out-of-order. This allows more aggressive code to be generated which will have good performance regardless of the branch direction (provided that enough execution units are available to execute the code from both branch paths without penalty):
txe2x80x2=a[i]+1;
txe2x80x3=a[j]+1;
a[16]=0;
if (i less than j)
t=txe2x80x2;
else
t=txe2x80x3;
One factor that limits the ability to reorder operations is ambiguous memory references; this is the case when a memory load operation appears after a memory store operation in a sequential instruction stream, and it is not possible to determine ahead of time whether the memory locations accessed by the load and the store operations are different. For example, consider the following code fragment:
*X=(a+b+2) less than  less than 4
r=((*Y)+c){circumflex over ( )}d
wherein:
*X indicates the memory location whose address is contained in X;
 less than  less than  indicates a left-shift operation; and
{circumflex over ( )} indicates an exclusive-or (XOR) operation.
Assuming that a, b, c, and d are values stored in registers r1 through r4 of a processor, respectively, and that X and Y are in registers r8 and r9, respectively, then this code fragment can be represented by the following instruction sequence (wherein the first register after the name of the instruction is the target register, and the remaining registers are the input operands):
If it can be determined that X and Y are different, then the two expressions can be scheduled for parallel execution, yielding a sequence as follows (wherein the symbol || denotes parallel execution):
In a machine with two execution units, the above sequence would take 4 cycles to complete (assuming that a load takes two cycles, and other operations take a single cycle).
On the other hand, if it cannot be determined whether X and Y are always different, i.e., the addresses are ambiguous, then the two expressions would have to be scheduled in the original order, taking 8 cycles (assuming again that a load takes two cycles).
The above example is not atypical. Ambiguity in memory references degrades performance fairly severely by forcing the sequential execution of operations that could otherwise be executed in parallel. However, such a serialization can be avoided (that is, the load operation can be performed earlier than the store operation) as long as the datum loaded out-of-order is the same as the value that would have been loaded after the store operation. Thus, the load operation performed earlier than the store operation is valid as long as the datum loaded out-of-order is coherent with the corresponding datum in memory (i.e., has the same value) at the original point of the load operation in the instruction stream (the in-order point). Moreover, if these values are coherent, any operation that depends on the datum loaded out-of-order can also be performed out-of-order. On the other hand, if the values are not coherent, then the datum loaded out-of-order and any results derived from it are invalid, making it necessary to re-execute the load operation at the in-order point, as well as the associated dependent operations.
Various attempts have been made towards solving the problem of reordering memory operations with ambiguous references by processors. These schemes assume that instructions are reordered from a single path only and are generally not applicable to out-of-order execution along multiple paths.
While the prior art has dealt with detecting interference between memory references along a single path, these inventions are generally not applicable for execution along multiple paths. To allow efficient out-of-order execution along multiple paths, only the interference along the eventually taken path should be reported and resolved.
Consider the following example, with values of i=16 and j=15.
The sequence of executed instructions for the values of i=16 and j=15 is then 1, 2, 3, 4, 7. Although interference exists between instructions 1 and 3, no action has to be taken since the value computed for txe2x80x2 is computed speculatively for the case wherein execution reaches line 5. However, since execution never reaches line 5, no corrective action is required.
The situation is different for values of i=16 and j=17, when the set of executed instructions is then 1, 2, 3, 4, 5. The interference between instructions 1 and 3 has to be corrected, since the speculatively computed result by instruction 1 does contribute to the actual computation of the program.
A summary of relevant art dealing with asynchronous memory operations in a multiprocessor environment that implements reordering of memory load operations along multiple paths of execution will now be given.
Current architectures which speculate along a single path of execution maintain a buffer of stores waiting to be written into the memory system. When a subsequent load occurs, its address is first compared to the entries in the store buffer (from newest to oldest) and, if a match occurs, then the value for the load is forwarded to the appropriate destination (instead of being fetched from cache or main memory). This method is inadequate when loads may be speculatively executed prior to stores, even if only a single path of execution is followed. To handle such speculative loads, a mechanism is needed when a store is reached to detect when a previous speculative load has (incorrectly) read data from the same location now being updated by the store. This check is typically accomplished by maintaining a content addressable memory (CAM) of speculative load addresses, and comparing all addresses therein when a store is reached. However, this scheme is still inadequate when loads from multiple paths may be speculated.
For example, consider a load operation from a path A that is speculatively executed prior to the conditional branch which precedes the load. Then assume that during execution of the program, path A is not executed, i.e., the conditional branch goes the other way to a path B. Further, assume that path B has a store operation which writes to the same address as the speculative load from path A. The machine does not know that there is no real conflict between the speculative load and the store. It also does not know that the load CAM can discard the entry for this speculative load. The present invention described hereinbelow overcomes both of these problems.
In U.S. Ser. No 08/829,669, now U.S. Pat. No. 5,931,957, entitled xe2x80x9cSupport for Out-of-order Execution of Loads and Stores in a Processorxe2x80x9d, filed on Mar. 31, 1997, assigned to the assignee herein, and incorporated herein by reference, a mechanism is disclosed for executing along a single predicted path. The mechanism is based on reorder buffers for out-of-order execution of load operations with respect to other load operations and store operations. As soon as interference is detected, the out-of-order instruction and all subsequent operations are flushed and out-of-order execution restarts. This approach is overly aggressive for execution along multiple paths, where interference along one, potentially untaken path, could result in the flushing of operations along another, possibly taken, path.
Several designs have been proposed for executing along multiple execution paths, but none of the designs address the issue of memory consistency or efficiently detect interference between reordered instructions when speculating along multiple paths. An architecture for executing instructions along multiple paths is described by Klauser et al., in xe2x80x9cSelective Eager Execution on the Polypath Architecturexe2x80x9d, 25th Annual International Symposium on Computer Architecture, pp. 250-59 (1998). This architecture uses xe2x80x9ccontext tagsxe2x80x9d to identify paths and tag data in store buffers. This information is used to selectively forward data from store buffers to load instructions. However, this architecture does not deal with detecting interference when load instructions are moved speculatively over store operations.
A VLIW processor for implementing the PowerPC architecture using binary translation is described by K. Ebcioglu, J. Fritts, S. Kosonocky, M. Gschwind, E. Altman, K. Kailas, and T. Bright, in xe2x80x9cAn Eight-Issue Tree-VLIW Processor for Dynamic Binary Translationxe2x80x9d, International Conference on Computer Design (October 1998). The VLIW processor uses load data verification to speculate load operations along multiple paths, and detect interference. Whereas the load operation is performed out-of-order, the load verify instruction is executed in-order (and only if control reaches a given path). As a result, interference will only be detected and repaired for the taken path, i.e., those load operations which influence the correct execution of the program.
This interference detection mechanism based on load data verification is described in U.S. Pat. No. 5,758,051, entitled xe2x80x9cMethod and Apparatus for Reordering Memory Operations in a Processorxe2x80x9d, issued on May 26, 1998, assigned to the assignee herein, and incorporated herein by reference. In this approach, data items accessed by an out-of-order load operation are read in-order, and the result of the in-order load operation is compared to the out-of-order result. If the two values are identical, then no detectable interference has occurred and the program continues execution. On the other hand, if the items are dissimilar, then the value returned by the in-order load operation is used to re-execute all dependent instructions. Note that for interference detection based on load verification, if interference cannot be detected, then it is presumed that no interference exists. This approach reduces the amount of hardware necessary to monitor interference and the number of re-executions, but requires additional bandwidth to perform a second in-order load for every load operation moved out-of-order.
3. Problems with State of the Art
The invention disclosed in the above referenced patent application Ser. No. 08/829,669 successfully addresses speculation along a single path by using re-order buffers, without incurring a cost in the form of additional memory subsystem accesses. However, the approach described therein is too zealous in restarting when performing out-of-order execution along multiple paths. As a result, interference may be detected and reported erroneously, either if (1) real interference exists along an execution path which is not taken, or (2) interference between instructions two disjoint paths may be reported if the processor supports speculative execution of store operations along multiple paths.
While interference testing based on load-verification allows accurate detection of interference in the presence of speculation along multiple paths, it does so at the cost of significantly increased memory bandwidth requirements.
Thus, it would be desirable and highly advantageous to have a method and apparatus for keeping track of ambiguities along each path and initiating re-execution of an ambiguous load operation along the actually executed path when that path has been determined.
The present invention is directed to a method and apparatus for reordering memory operations along multiple execution paths in a processor. The present invention utilizes path information to perform out-of-order execution and speculation along multiple probable execution paths. Interference information is maintained to ensure correct operation of the memory system without reporting spurious interference conflicts (of the type described above with respect to the prior art).
According to one aspect of the present invention, there is provided a method for scheduling instructions for execution along multiple paths in a computer processing system implementing out-of-order execution. The method includes the step of selecting and moving a next instruction from its current position in a sequence of instructions to an earlier position. It is determined whether the selected instruction may reference a memory location for read-access. It is determined whether the selected instruction was previously moved over a non-selected instruction which may ambiguously reference the memory location, when the selected instruction may reference the memory location for read-access. It is determined whether the selected instruction was previously moved over a branch instruction, when the selected instruction was previously moved over the non-selected instruction. A record of the selected instruction is stored for future reference, when the selected instruction was previously moved over the branch instruction. The record includes a path specifier for indicating a path from a current locus of execution to a basic block corresponding to a in-order position of the selected instruction.
According to another aspect of the present invention, there is provided a method for checking for interference between a current memory operation and previously executed load instructions in a computer processing system implementing out-of-order execution along multiple paths. The method includes the step of storing in a table a plurality of entries, wherein each given entry corresponds to a given speculative operation and comprises an address field for storing a memory address corresponding to the given speculative load operation, and an interference field for indicating whether the given speculative operation has been interfered with. It is determined whether a current entry in the table refers to a speculative load operation in a current basic block. It is determined whether a first interference exists between the memory address stored in the address field of the current entry and the current memory operation, when the current entry refers to the speculative load operation in the current basic block. Re-execution of the speculative load operation is initiated, when the first interference exists. It is determined whether a second interference exists between the current memory operation and the speculative load operation, when the current entry does not refer to the load operation, the speculative load operation having been moved over at least one branch and the current memory operation. The second interference is recorded in the interference field of the current entry, when the second interference exists.