1. Field of the Invention
This invention relates to a loop optimization system in an electronic computer capable of executing plural sequences of instructions in parallel.
2. Description of the Related Art
Conventionally, plural sequences of instructions are designed for serial processing by the electronic computer, which executes instructions, one at a time, according to the instruction sequence.
Pipelining techniques have been introduced for the high-speed processing of an instruction sequence. In this technique, the execution of instructions is divided into a plurality of stages, and the instructions belonging to the different stages thus obtained are carried out simultaneously. This approach can shorten the cycle time, thereby shortening the processing time of the entire instruction sequence. Such a pipeline system, however, still executes only one instruction per cycle, far from two or more instructions per cycle of parallel execution.
In recent years, systems performing parallel execution on an instruction level have been introduced, which enables execution of more than one instruction per cycle. The VLIW (Very Long Instruction Word) system and the superscalar system are two typical ones of this type.
In the VLIW system, a predetermined number of instructions are defined as a single execution unit, and the computer always executes the set of instructions at the same time. Here, the computer does not have to judge whether a plurality of instructions to be executed are able to be carried out simultaneously, which simplifies control, thus making it possible to construct the system using a small amount of hardware and attain shorter cycle times. However, it is necessary for a compiler or an expert to judge whether the processing is suitable for parallel execution and base on the result, allocate appropriate instructions in advance.
The superscalar system has hardware that interprets a plurality of instructions designed for serial processing to determine whether they can be executed in parallel, and that when finding parallel execution possible, carries out more than one instruction in parallel. In this system, the judgment of whether instructions are suitable for parallel execution is left to special hardware, which assures that even an ordinary program is executed, keeping compatibility with serial executions. To improve the performance of the superscalar system, it is essential to perform the scheduling of instructions as with the VLIW system in order to place the operation units in an operating state as long as possible, depending on information such as hardware resources or data dependence.
For the parallel execution system, especially for an instruction rearranging system to speed up the execution of loop portions in a program, a software pipelining system is available which allocates one iteration of loop to a unique hardware resource and simultaneously executes virtual iterations of loops using separate processing units in a pipeline manner.
For example, it is assumed that software pipelining is executed by a VLIW computer composed of a floating point unit, a memory (load, store) unit, an integer unit, and a branch unit. It is also assumed that a floating point add instruction and a multiply instruction have a delay of 2 cycles.
A loop written in the C language in FIG. 1A will be explained.
If scalar variables b and c have been loaded into the register, the following four processing stages will complete one iteration of loop:
(1) Loading into A[i]
(2) Multiplying A[i]*b
(3) Adding A[i]*b+c
(4) Storing the result
Therefore, a set of instructions in a conventional serial computer may be obtained as shown in FIG. 1B. Since the software pipelining deals with each stage using a separate unit, the executing state at each clock is such as shown in FIG. 1C. With the steady state of loop execution at clocks 7 and 8, load and mul at the k+3 time iteration, add at the k+1 time iteration, and store at the k time iteration are processed in a multiplex manner. This software pipelining allows efficient loop processing without idling the units (for addition, multiplication, loading, and storing) the hardware has. The VLIW and superscalar systems require the above rearrangement to be carried out by the compiler without handling (modifying) the source program.
In the loops in FIGS. 1A through 1C, use of software pipelining enables efficient rearrangement, but does not always provide the maximum parallel processing capabilities the processor has. The reason will be explained below.
The portions indicated by `-` in FIG. 1C are delay cycles in calculation. Generally, in the VLIW and superscalar systems, each processing unit often undergoes pipeline control. If an independent operation instruction is available for such a delay cycle, it is possible to supply that instruction to each processing unit to insert it in a delay cycle for higher parallel performance. In this example, however, there is no appropriate operation for this purpose. Thus, even if the original loop has empty instruction slots, because of very simple processing, those slots cannot be filled with suitable instructions, sometimes failing to achieve the maximum parallel performance.
For example, the loop in FIG. 2A is a simple loop with only one addition. Instructions corresponding to the loop of FIG. 2A is shown in FIG. 2B. Three-stage software pipelining of the instruction only provides the executing state in FIG. 2C. In this case, although another independent additional operation can be inserted into two clocks of the delay cycle in the add operation, those 2 clocks are wasted because there is no appropriate instructions available for them. Therefore, for such a loop, direct use of software pipelining in the VLIW or superscalar system cannot provide the maximum parallel processing capabilities by making full use of the operation units the system has.
As noted above, when the number of operation instructions in one loop iteration, the number of memory access instructions, and the number of operation units the processor has harmonize with each other, loop optimization by software pipelining provides efficient parallel execution. When the loop has a small number of operations, however, some of the operation units in the processor lie idle, resulting in less effective parallel operation.
As described above, conventional software pipelining for the VLIW or superscalar system has a problem: when a loop in a program has a small number of operations compared with the number of operation units the processor has, it is impossible to perform parallel execution using all the operation units in the processor, making it unable to achieve satisfactory parallel processing performance.
The related literature of this invention includes S. Weiss and J. E. Smith, "A STUDY OF SCALAR COMPLICATION TECHNIQUES FOR PIPELINED SUPERCOMPUTERS," Proc of 2nd ASPLOS, 1987, pp. 105-109, and M. Lam, "Software Pipelining: An Effective Scheduling Technique for VLIW Machines," Proceeding of the SIGPLAN' 88 Conference on Programming Language and Implementation, Jun., 22-24, 1988, pp. 318-328.