1. Field of the Invention
The present invention relates to a data processing system, and more specifically to a pipelined instruction processing system having a plurality of instruction processing units located in parallel for executing a plurality of instructions in parallel and each having a pipelined mechanism for high speed data processing.
2. Description of Related Art
In order to elevate performance of a computer system, a parallel pipelined instruction processing system combining instruction pipelined processing and a VLIW (very long instruction word) type parallel instruction processing has been known.
The VLIW type parallel instruction processing is such that a relatively long instruction including a plurality of instruction fields (called an "instruction block" hereinafter) is processed as one instruction. A VLIW type parallel computer system treats each instruction block by dividing the instruction block into a plurality of fields and by processing the plurality of fields in parallel to each other by independently controlling a number of operation units, registers, interconnecting networks, memories and others.
In brief, at the time of compiling, a plurality of instructions which can be processed in parallel are extracted from a source program, and then, combined to form one instruction block. Therefore, if the degree of parallelism near to the number of parallel processing units can be obtained, high speed processing can be attained. However, if the degree of parallelism is low, an empty instruction field or empty instruction fields occur, with the result that the processing performance is decreased. In fact, to what extent the instructions fields can be filled is dependent upon the capability of the compiler and the source program.
In the VLIW system, however, since parallelization of instructions is executed at time of the compiling, it is not necessary to carry out a complicated processing such as detection of mutual dependence between items of data. Therefore, the hardware can be simplified.
This VLIW system can be said to be based on an idea originating from a horizontal microinstruction system, and to be suitable for an elaborate parallel processing using a plurality of processing units having a low degree of function (low level parallel processing).
In general, a process of executing a machine instruction in the computer system is achieved by sequentially performing an instruction fetching (abbreviated as "IF" in the specification and in the accompanying drawings), an instruction decoding (abbreviated as "ID"), an operand address generation (abbreviated as "AG"), an operand fetching (abbreviated as "OF"), execution of operation (abbreviated as "EX"), and a writing back of the result of the execution of operation (abbreviated as "WB") in the named order. The instruction pipelined processing is realized by dividing the above mentioned processing into a plurality of processing stages, providing individual hardwares corresponding to each of the processing stages, and causing each individual processing hardware to execute its assigned processing in parallel to execution of the other processing hardwares.
As seen from the above, in the instruction pipelined system, the respective processing stages operate in overlay with each other. Therefore, if an execution time of each processing stage is the same and if a machine cycle of each processing stage is the same, the instruction pipelined system can exhibit its maximum performance, and the result of the operation can be obtained at every machine cycle.
At present, it has been considered that a flow of the instruction pipelined processing is disturbed:
(a) when a succeeding instruction requires the result of execution of a preceding instruction; PA1 (b) when a preceding instruction determines an operand address for a succeeding instruction; PA1 (c) when a branch is generated; PA1 (d) when memory accesses conflict with each other; PA1 (e) when a preceding instruction rewrites a content of a succeeding instruction; PA1 (f) when an interrupt or a branch occurs; PA1 (g) when an instruction is so complicated as to need a plurality of machine cycle for execution of a required operation.
In order to suppress the above mentioned factors disturbing the instruction pipelined processing, various improvements have been attempted. For example, in order to suppress disturbance of the pipelined processing caused by a conditional branch, there have been proposed a loop buffer system using a large instruction buffer capable of storing a program loop, a plural instruction flow system processing an instruction stream when a condition for the conditional branch is satisfied and another instruction stream when the condition for the conditional branch is not satisfied, and a branch estimation system estimating a branch on the basis of a history of branch instructions.
In any case, in the pipelined processing, whether or not a branch condition for a conditional branch instruction is satisfied, cannot be known unless the processing reaches a later stage of the pipelined processing. If the branch condition is satisfied, it is necessary to invalidate instructions already fetched in the instruction pipelined system, and to fetch a flow of instructions starting from a branch destination instruction. In other words, at least one empty machine cycle is generated. Accordingly, when the branch condition is satisfied and the branch is executed, the instruction pipelined processing is delayed and the overall processing capacity is decreased.
In order to minimize this decrease of performance, a delayed branch mechanism has been used. This delayed branch mechanism is such that a branch instruction is deemed to be a delayed typed instruction which is executed at a time later than issuance of the branch instruction by one machine cycle, and an instruction slot immediately after the branch instruction is filled with an valid instruction by means of an instruction scheduling performed by a compiler, so that the disturbance of the pipelined processing is avoided so as to prevent the decrease of the performance.
However, if the instruction slot immediately after the branch instruction was not filled with an valid instruction, it is necessary to fill a NOP (no-operation) instruction into the instruction slot immediately after the branch instruction. In this case, of course, the performance is decreased.
To what degree the delayed instruction slots can be filled with valid instructions, is dependent upon a performance of the compiler. At present, it has become possible to effectively utilize about 80% to 90% of the delayed instruction slots by using a recent compiler technique.
The above mentioned VLIW type parallel instruction processing and the above mentioned instruction pipelined processing are combined to form a parallel pipelined instruction processing system.
In this parallel pipelined instruction processing system, a field exclusively used for a branch instruction is provided in an instruction format, and a branch instruction processing unit is provided separately from other instruction processing units, so that the processing for the branch instruction is speeded up. Accordingly, the branch instruction and other instructions are processed in parallel.
Conventionally, since an operation instruction and a load/store instruction have been processed in parallel, the parallel pipelined instruction processing system has been such that a processing unit for an operation instruction, a processing unit for a load/store instruction and a processing unit for a branch instruction can operate in parallel to each other.
Because the processing unit used for exclusively processing the branch instruction has been added, a conditional branch instruction can be executed at a high speed, and when a branch condition is satisfied, no delay is generated by the branching. Therefore, the processing flow of the pipelined processing is not disturbed. In addition, since the independent instruction processing units are controlled in parallel to each other, a VLIW type instruction composed of a plurality of instruction fields controlling the respective instruction processing units has been adopted as an instruction format.
The performance of the parallel pipelined instruction processing type parallel instruction processing system is dependent upon how many instruction functions are filled into each instruction block. In order to optimize a program, a local optimization and a wide area optimization have been known. Now, consider a flow of basic operations having no branch operation excepting for an exit of the flow and in which an branch from an external source is not received excepting for an inlet of the flow. This is called a "basic block" in the specification. The local optimization is to check dependency between items of data in each basic block, to detect basic operations which can be executed in parallel, and to combine the detected basic operation into a VLIW type instruction. On the other hand, the wide area optimization is an optimization accompanied with movement of basic operations between basic blocks.
However, in the parallel pipelined instruction processing type parallel instruction processing system, the conditional branch instructions are very many, and a length of the basic block is short. As a result, a large effect cannot be obtained by these optimization procedures.
In view of the above, the parallel pipelined instruction processing type parallel instruction processing system has conventionally used an optimization procedure called a "trace scheduling method" which is very effective in the case that a condition branch has a large deviation (for example, application to chemical calculation and the like).
In the above mentioned parallel pipelined instruction processing type parallel instruction processing system, instructions other than a conditional branch instruction in an instruction block including the conditional branch instruction are executed regardless of whether or not a branch condition is satisfied. Therefore, it is necessary to fill, into an instruction block including a conditional branch instruction, instructions that can be executed independently of satisfaction/failure of the branch condition, as instructions other than the conditional branch instruction in the instruction block including the conditional branch instruction.
Now, consider a parallel pipelined instruction processing system in which four instruction pipelined processing units are arranged in parallel so as to execute a VLIW type instruction having four fields. One stage of the four parallel processing units is used for exclusively processing only branch instructions, so that the parallel pipelined instruction processing system having no delay caused by the conditional branch is established.
In this case, the instruction block including the conditional branch instruction has three instruction fields other than the conditional branch instruction. As mentioned hereinbefore, it is necessary to fill into the three fields, instructions that can be executed regardless of satisfaction/failure of the branch condition. If the three fields are not filled, three empty fields occur at maximum.
Now, consider the delayed branch mechanism for the parallel pipelined instruction processing system having one machine cycle of branch delay. This parallel pipelined instruction processing system has a delayed instruction slot corresponding to one slot. Since one instruction is composed of four instruction fields, the delayed instruction slot equivalently corresponds to four instructions. In addition, three instruction fields of the instruction itself having the branch instruction need to be treated similarly to the delayed instruction slot in view of the instruction dependency. Therefore, the four-parallel pipelined instruction processing system can be considered to be equivalent to a serial pipelined instruction processing system having seven delayed instruction slots.
As mentioned hereinbefore, in the delayed branch mechanism for non-parallel instruction pipelined processing, a probability of filing even one valid instruction into an instruction slot immediately after the branch instruction is about 80% to 90% even if the current compiler technology is used. In view of this circumstance, it is extremely difficult to effectively utilize the seven empty instruction slots. Accordingly, the parallel pipelined instruction processing system utilizing the conventional pipelined mechanism having the branch delay of one machine cycle decreases its processing performance when a branch instruction is executed.
Even in the parallel pipelined instruction processing system which does not utilize the delayed branch, it is necessary to file executable instructions into three instruction fields of an instruction block having a conditional branch instruction by means of the instruction scheduling. Therefore, NOP instructions have to be filled into most of the instruction fields. Therefore, even in the parallel pipelined instruction processing system which does not utilize the delayed branch, since the number of the empty instruction fields in the instruction block including the conditional branch instruction increases, the processing performance is remarkably decreased when the instruction block including the conditional branch instruction is executed.
As mentioned above, the trace scheduling is effective in optimizing the basic operations on a long trace. On the other hand, since a copy of blocks frequently occurs, a code size of a program becomes very large.