1. Field of the Invention
This invention relates in general to computer systems and, in particular, to a method and apparatus for improving the efficiency and speed of operation of digital computer systems. More specifically, the invention relates to methods and apparatus to reorder the sequence of instructions in a computer program for faster operation.
2. Background Art
Speed is an important criteria by which a computer is judged. The general long term trend in the industry is development of increasingly more efficient higher speed computers.
One way of increasing the speed of execution of computers is to reduce memory latencies. A memory latency is the time delay between the moment a request is made for an operand from memory at the beginning of an instruction cycle to the time it is delivered to the appropriate register in the processor. Given the speed and efficiency with which processors and CPU's can process information at present, memory latencies can cause relatively long time delays between the time a processor becomes idle after initiating a memory request and can actually start a work cycle when the requested data is returned from memory. In fact a processor can easily spend more of its time waiting for operands or information to work on than is consumed in data processing.
The ideal situation would be to bring the operand into the appropriate register in the processor as soon as both the operand and the register are available. In many instances this can be done well before the processor has a need for the operand. The operand would then be readily available to the processor when needed, eliminating lost time waiting for the operand to be delivered from memory. However, a number of impediments exist which prevent such early loading of an operand into the appropriate register in the processor.
One such impediment is that a computer must perform its operations in a fairly rigid sequential pattern to maintain the integrity of its output. Performance of its operations in sequence by a computer entails loading the necessary instructions into the processor and loading into the appropriate registers the particular operands necessary for that instruction. The processor then performs the called for operation on the operand or operands. The computer then stores the results. Once the storing of the results is completed, the processor commences the next instruction cycle by fetching the instruction needed, then the operand or operands needed etc., and continues with roughly the same process. Generally, before execution of the next instruction can be commenced, the prior instruction cycle must be completed and its results stored.
Recent developments have resolved some of the aforementioned impediments. One such recent development involves processors designed with an instruction pipeline or queue feature, which allows the processor to fetch several instructions at a time and have them available before they are needed by the processor for execution. Upon completion of each instruction cycle, the computer has available in the processor the next instruction to be executed so there is no delay caused by waiting for the next instruction from memory. This is a feature common to uniprocessors, as well as more sophisticated processors. Also, almost all high performance computers can overlap the execution of several instructions. High performance computers, to make optimal use of their hardware resources, execute their instructions (or the primitive operations comprising them) often in an order different than the one specified in the original program.
There are a number of ways to overlap instructions or operations in the operation of a computer. One alternative is to implement it through hardware alone, but this is a complicated and not too promising method. Another method is to have the compiler reorder the execution of the program through an optimization process; this is by far the more efficient and effective method.
The compiler, during compilation of a program, will reorder the sequence of instructions and form sets of instructions whose execution can be overlapped. The execution of two instructions can overlap and violate the order specified in the program only if the compiler can guarantee, through program analysis, that the storage locations, memory or register locations, used to store the results produced by any one of these instructions are different from those used by other instructions to fetch its input operands and to store its results. In the literature on this subject, these are known as anti/output/true data-dependence constraints.
Enforcing the above mentioned constraints is easy when operands are fetched from registers and results are stored into registers. However, when the operand is fetched from memory or the results are stored into memory, the task is much more difficult. The problem is particularly acute when the addresses of memory locations referenced by the instructions are to be computed when the program is executed by the computer. In this situation, determining whether two address calculations will yield the same address at the time the program is executed is theoretically impossible or requires complex program analysis capabilities not expected from compilers in the foreseeable future. Also, compile time enforcement of data dependencies becomes exceedingly conservative when indirect addressing or pointers are used to access data.
Experience with program analysis techniques indicates that in most situations, where a compiler cannot determine whether two addresses to be computed and used in the program will be identical or different at the time of program execution, there is an extremely high probability the addresses will turn out to be different. However, the compiler is forced to assume, for the sake of the program integrity and correctness, that they will be the same, resulting in a significant loss in available parallelism, operating efficiency and speed. It would be an unsafe compiler optimization for the compiler to do otherwise and assume there will be no address conflict.
The problem then, simply stated, is how to allow a compiler to fully optimize the potential for out of sequence operand fetching? An out of sequence operand fetch generally being a load operation of an operand ahead of one or more store operations in the compiled form of the program, the load of the operand having been at a point after the store operation in the unoptimized form of the program.
One attempt to have the compiler optimize the execution of a program by allowing load operations to be executed out of sequence ahead of store operations is described in "Run-Time Disambiguation: Coping with Statically Unpredictable Dependencies", by Alexandru Nicolau; IEEE Transactions On Computers, Vol. 38, No. 5, May 1989. Nicolau's article describes a process in which a compiler identifies when a load can possibly be moved ahead of a store operation, then inserts necessary coding so the processor can check at the time the program is executed by the computer to determine if there is a match between the address of the store and the load operation. If there is no match, the processor then executes a branch operation to an optimized sequence of instructions where the load has been moved ahead of the store. If there is a match, the processor takes a branch to a safe code which does not allow the load operation to be moved ahead of the store operation. Having the processor do the checking to determine if there is a match between the address from which the load originated and the address assigned to the store during program execution is a hindrance to increasing speed of execution. It in fact sometimes appears to take longer for a program to be executed in Nicolau's arrangement than when the program is run in the program's original unoptimized sequence. Nicolau notes this fact in his article. One of the difficulties with the process of Run Time Disambiguation is that the processor or CPU must do all of the work, including the comparison of the addresses, a process for which the CPU is not suited. Each Arithmetic and Logic Unit (ALU) in the CPU can only compare one address of a load operation with one address of a Store operation at a time. The CPU must generate the address of the store operation and compare it to the address of the load operation before the load operation is moved out of sequence and executed.