1. Field of the Invention
The present invention generally relates to reordering memory operations in a superscalar or very long instruction word (VLIW) processor in order to exploit instruction-level parallelism in programs and, more particularly, to a method and apparatus for reordering memory operations in spite of arbitrarily separated or ambiguous memory references, thereby achieving a significant improvement in the performance of the computer system. The method and apparatus are applicable to uniprocessor and multiprocessor systems.
2. Background Description
High performance contemporary processors rely on superscalar and/or very long instruction word (VLIW) techniques for exploiting instruction level parallelism in programs; that is, for executing more than one instruction at a time. These processors contain multiple functional units, execute a sequential stream of instructions, are able to fetch two or more instructions per cycle from memory, and are able to dispatch two or more instructions per cycle subject to dependencies and availability of resources. These capabilities are exploited by compilers which generate code that is optimized for superscalar and/or VLIW features.
In sequential programs, a memory load operation reads a datum from memory, loads it in a processor register, and frequently starts a sequence of operations that depend on the datum loaded. In a superscalar or VLIW processor in which there are resources available, it is advantageous to initiate memory load operations as early as possible because that may lead to the use of otherwise idle resources and may hide delays in accessing memory (including potential cache misses), thus reducing the execution time of programs. The load, as well as the operations that depend on the load, are executed earlier than what they would have been in a strictly sequential program, achieving a shorter execution time. This requires the ability to perform non-blocking loads (i.e., continue issuing instructions beyond a load which produces a cache miss), the ability to issue loads ahead of preceding stores (i.e., out-of-order loads), the ability to move loads ahead of preceding branches (i.e., speculation), and the ability to move operations that depend on a load ahead of other operations. In other words, what is needed is the ability to reorder the operations of the program.
Several factors limit the ability to perform reordering of memory operations, in particular factors arising from run-time dependencies in the execution of a program. These include moving operations ahead of conditional branch instructions and ambiguous memory references.
Moving an operation ahead of a preceding conditional branch instruction introduces speculation in the execution of a program, because the operation is executed before it is known whether it will be really required. The code motion is performed under the expectation that the operation will be needed. Register-to-register operations with no side-effects can be executed speculatively, as long as the results are saved in unused ("dead") registers. If an operation was not required, the result is just ignored. On the other hand, register-to-register operations with side-effects and memory load operations can be executed speculatively only if there exist mechanisms to recover from side effects which should not have been produced, such as exceptions (errors), protection violations, or accesses to volatile memory locations.
Moving a memory load operation ahead of a preceding memory store operation faces the problem of ambiguous references in the execution of the program if it is not possible to determine at compile time that the memory locations accessed by the load and store are different. Unambiguous memory references can be executed out-of-order because they do not conflict. On the other hand, ambiguous memory operations can be executed out-of-order only if there exist mechanisms to detect a potential conflict, ignore the data loaded ahead of time, and reload the correct value after the store operation has been performed. The conflict may be in a single byte of a multiple byte operand, so the store operation must be completed before the load operation can be performed.
Although the two problems described above are different, their effects and requirements are the same. Namely, there must exist mechanisms to detect and recover from the side effects or ambiguities. In the following discussion, both of these problems are referred to as "reordered memory accesses problems".
Contemporary compilation techniques include static memory disambiguation algorithms for reordering memory operations. These algorithms determine if two memory references, a memory store operation followed by a memory load operation, access the same location. If the references do not conflict (i.e., they address different memory locations), then it is possible to reorder the operations so that the load can be executed ahead of the store. Static disambiguation works well only if the memory access pattern is predictable. Frequently, that is not the case, and the compiler/programmer must make the conservative assumption that their references actually conflict so they must be executed sequentially (in their original order), which reduces the potential instruction-level parallelism in the program.
Reordering of memory operations has been a subject of active interest. See, for example, the article by K. Diefendorff and M. Allen entitled "Organization of the Motorola 88110 superscalar RISC microprocessor", IEEE Micro., April 1992, pp. 40-63. The dynamic scheduler in the Motorola 88110 processor dispatches store instructions to a store queue where the store operations might stall if the operand to be stored has not yet been produced by another operation. Subsequent load instructions can bypass the store and immediately access the memory, achieving dynamic reordering of memory accesses. An address comparator detects address hazards and prevents loads from going ahead of stores to the same address. The queue holds three outstanding store operations, so that this structure allows runtime overlapping of tight loops. The structure does not really move a load earlier in the sequential execution stream; instead, it only allows for a load operation not to be delayed as a result of a stalled store operation.
The static motion of load/store operations out from loops, under certain conditions, was described by K. Ebcioglu, R. Groves, K. Kim, G. Silberman, and I. Ziv in "VLIW compilation techniques in a superscalar environment" SIGPLAN Conference on Programming Language Design and Implementation (PLDI '94), 1994. This approach is basically a generalization of the static movement of loop-invariant instructions out of loops, with the additional capability of moving loads and stores which are executed conditionally if they are considered safe. The conditions required for this optimization include guaranteeing that there is no possibility for conflicting memory references (ambiguous memory references), which is not always possible.
A compilation technique which allows scheduling of speculative loads without modifying the architecture of the processor is described by D. Bernstein, M. Rodeh and M. Hopkins in their patent application entitled "Instruction scheduler for a computer" Ser. No. 08/364,833 filed Dec. 27, 1994, as a continuation of application Ser. No. 07/882,739 filed May 14, 1992, and assigned to the assignee of this application now U.S. Pat. No. 5,526,499. In this approach, the suitabilility of a load operation for speculative execution is determined by classifying it into a number of categories depending on conditions applied to the base register used by the operation and/or the contents of such a base register. Thus, as in the techniques described by K. Ebcioglu et al., supra, this approach is restricted to those cases that can be detected at compile time.
A hybrid memory disambiguation technique called "speculative disambiguation" was proposed by A. Huang, G. Slavenburg, and J. Shen in "Speculative disambiguation: a compilation technique for dynamic memory disambiguation", 21st Intl. Symposium on Computer Architecture, Chicago, pp. 200-210, 1994. This approach uses a combination of hardware and compiler techniques to achieve its objective. It performs transformations on the code to anticipate either outcome of an ambiguous memory reference, requiring guarded execution capabilities in the hardware. For each pair of ambiguous memory references, the compiler creates two versions of the code that depends on the memory reference. One version assumes that the addresses overlap, whereas "the other version assumes they do not overlap. In both versions, operations that do not have side effects are executed, while operations that have side effects are guarded by the result of comparing the two addresses. This approach requires more operations and resources than the original program, in addition to capabilities for guarded execution, deals only with disambiguation, but does not have capabilities for moving load operations ahead of branches.
Another alternative to perform compiler optimization of program execution by allowing load operations to be executed ahead of store operations is described by A. Nicolau in "Run-time disambiguation: coping with statically unpredictable dependencies", IEEE Trans. On Computers, vol. 38, May 1989. This approach relies on compiler identification of a load, which can be moved ahead of a store operation, and compiler insertion of the necessary code, so that the processor can check at run-time if there is a match among the address of the load and store operations, as described by A. Huang et al., supra, but without guarded-execution capabilities. If there is no match, the processor executes a sequence of instructions in which the load has been moved ahead of the store. On the other hand, if there is a match, the processor executes a sequence of instructions in which the load operation is performed after the store operation. Since the check for the address match is performed by the processor, this approach leads to potential performance degradation due to the execution of more instructions and their associated dependencies (e.g., the explicit generation of the memory addresses and the address compare). Moreover, the reordered load operation cannot be performed until the memory addresses for both load and store operations have been resolved.
A method and apparatus for improving the performance of out-of-order operations is described by M. Kumar, M. Ebcioglu, and E. Kronstadt in their patent application entitled "A method and apparatus for improving performance of out-of-sequence load operations in a computer system", Ser. No. 08/320,111 filed Oct. 7, 1994, as a continuation of application Ser. No. 07/880,102 filed May 6, 1992, and assigned to the assignee of this application now U.S. Pat. No. 5,542,075. This method and apparatus uses compiler techniques, four new instructions, and an address compare unit. The compiler statically moves memory load operations ahead of memory store operations, marking all of them as out-of-order instructions. The addresses of operands loaded out-of-order are saved to an associative memory. On request, the address compare unit compares the addresses saved in the associative memory with the address generated by store operations. If a conflict is detected, recovery code is executed to correct the problem. The system clears addresses saved in the associative memory when there is no longer a need to compare those addresses. This approach only addresses the problem of reordering memory operations. It does not include the ability to speculatively execute memory load operations. Moreover, this approach requires special instructions to trigger the checking for conflicts in addresses, as well as to clear the address of an operand no longer needed, and imposes a burden on the compiler which has to detect and pair all potential conflicts. As a consequence, this approach cannot cope with conflicts that occur as a result of an unexpected combination of store/load instructions (perhaps produced by error), neither can it be used in a coherent multiprocessor context.
As a related subject, a hardware mechanism coupled with compiler support is described by G. Silberman and M. Ebcioglu in their patent application entitled "Handling of exceptions in speculative instructions", Ser. No. 08/377,563 filed on Jan. 24, 1995, and assigned to the assignee of this application. This mechanism reduces the overhead from exceptions originated by instructions executed speculatively. The mechanism relies on hardware resources such as an additional bit per register to indicate an exception generated during the speculative execution of an instruction, two additional register files to save the register operands so that speculative instructions invalidated by an exception can be re-executed, as well as information that allows tracing back to the source of the exception. This mechanism is applicable only to speculative instructions, not to reordered memory operations.
A method and apparatus for reordering load instructions is described in the patent application entitled "Memory processor that permits aggressive execution of load instructions" by F. Amerson, R. Gupta, V. Kathal and M. Schlansker (UK Patent Application GB 2265481A, No. 9302148.3, filed on Apr. 2, 1993). This patent application describes a memory processor for a computer system in which a compiler moves long-latency load instructions earlier in the instruction sequence, to reduce the loss of efficiency resulting from the latency of the load. The memory processor saves load instructions in a special register file for a period of time sufficient to determine if any subsequent store instruction that would have been executed prior to the load references the same address as that specified by the load instruction. If so, the memory processor reinserts the original load in the instruction stream so that it gets executed in-order. Thus, this system permits moving loads ahead of stores under compiler control, and relies on hardware to insert code to recover from a conflict. However, this system does not permit reordering other instructions that depend on the load (the hardware resources are able to reinsert only the load instruction), neither it allows for speculative execution of loads or other instructions. In other words, the method and apparatus is limited to hiding the latency of load instructions, whose maximum value must be known at compile time.