1. Field of the Invention
The present invention relates to parallel processing apparatuses, and particularly to a superscalar-type processor. More particularly, the present invention relates to a control scheme for executing and invalidating instructions which are supplied to pipelines (functional units) after a branch instruction in a superscalar-type processor.
2. Description of the Background Art
In accordance with progress in semiconductor technique in recent years, performance of a microprocessor has become higher, and its operation speed has also become higher. Although the operation speed of a semiconductor memory has also become higher, its speed-up progress can not follow the speed-up progress of a microprocessor, and access to a semiconductor memory has provided a bottleneck in the speed-up of a processor. Therefore, performance of a microprocessor has been enhanced by performing parallel processing. A superscalar is one of the systems for realizing such parallel processing. As illustrated in FIG. 1, a processor of the superscalar type (hereinafter simply referred to as a superscalar) is constructed such that a scheduler (normally provided in an instruction decoder) 200 in the processor detects parallelism in an instruction stream, and supplies instructions which can be processed in parallel to pipelines (functional units) P1, P2, and P3 provided in parallel. It may be said that a superscalar is a computer (or a processor) having the characteristics described in the following.
(1) It simultaneously fetches a plurality of instructions.
(2) It includes a plurality of functional units (pipelines) and is capable of executing simultaneously a plurality of instructions.
(3) It detects instructions which can be simultaneously executed in the fetched plurality of instructions to supply them to corresponding functional units.
FIG. 2 is a diagram illustrating a general structure of a superscalar. Referring to FIG. 2, a superscalar comprises a plurality of functional units 4, 5, 6, and 7 each executing a predetermined function, an instruction fetch (IF) stage 2 fetching simultaneously a plurality of instructions from an instruction memory 1, an instruction decode stage 3 receiving simultaneously the instructions from instruction memory 1 fetched by instruction fetch stage 2 and detecting instructions which can be simultaneously executed to supply the detected instructions to corresponding functional units, and a data memory 8 for storing operational processing results or the like.
Instruction memory 1 generally includes a cache memory and an external main memory and stores instructions necessary for program execution.
Instruction fetch stage 2 provides an instruction pointer IP to instruction memory 1 and fetches simultaneously a plurality of instructions corresponding to the instruction pointer IP from instruction memory 1.
Instruction decode stage 3 includes an instruction decoder, a pipe line sequencer. The instruction decoder receives and decodes the plurality of instructions fetched by instruction fetch stage 2. The pipe line sequencer (an instruction scheduler) identifies the machine types of the decoded plurality of instructions to issue simultaneously instructions of different machine types to corresponding functional units. The machine types indicate in which functional units the instructions should be processed.
Functional units 4-7 are pipelined and execute the received instructions in response to a clock signal. Referring to FIG. 2, four functional units are illustrated as an example, and four instructions at a maximum can be processed in parallel.
Functional units 4 and 5 are integer arithmetic operation units performing integer addition and so on and each includes an execution stage (EX) and a write stage (a machine state changing stage; WB). The write stage (WB) writes the processing result of an instruction executed in the execution stage into a data register (not shown).
Functional unit 6 is a unit executing access (loading or storing data) to data memory 8 and includes an address generating stage (ADR), a stage for executing access to memory (MEM), and a write stage (WB) for writing data into data register (not shown). In the write stage (WB) in functional unit 6, the data loaded from data memory 8 or the data to be stored in data memory 8 are written or read into or from the register.
Functional unit 7 is a unit executing floating point arithmetic operations and includes three execution stages (EX1, EX2, and EX3) and a write stage (WB) for writing execution results into a data register (not shown). A floating point number is a number represented using an exponent and a mantissa without making the position of the decimal point be fixed. A floating point arithmetic operation is an operation using floating point numbers, which enables arithmetic operations of numbers of a wide range in comparison with an integer arithmetic operation, while it requires more cycles for operational processing than an integer arithmetic operation.
In the superscalar, instruction fetch stage 2, instruction decode stage 3, and functional units (4-7) are also pipelined, and these stages operate overlapping with each other. Accordingly, in a case where there is no blank in the pipelines, the data or the instructions processed in a preceding cycle are supplied to respective stages. For example, an instruction decoded by instruction decode stage 3 is the instruction fetched in the preceding cycle.
FIG. 3 is a schematic diagram illustrating the structure of the instruction decode stage. Referring to FIG. 3, instruction decode stage 3 comprises four decode circuits D1-D4 provided in parallel and a pipeline sequencer SC responsive to the decoding results of decode circuits D1-D4 for detecting instructions which can be processed in parallel for issuing the instructions to related functional units.
Decode circuits D1-D4 are provided corresponding to instructions M1-M4 simultaneously fetched from instruction memory 1 to decode corresponding instructions for transmitting the decoding results to the pipeline sequencer SC. The instruction FM supplied to instruction decode stage 3 also includes addresses (logical addresses in instruction memory 1 supplied from the instruction fetch stage) A1-A4 corresponding to respective instructions.
When a branch instruction is included in the fetched instruction FM, the pipeline sequencer SC controls generation of a branch according to the branch instruction, setting of a branch target address in the instruction fetch stage, and supply of instructions subsequent to the branch instruction to functional units. Now, operation will be simply described with reference to FIGS. 2 and 3.
Instruction decode stage 3 supplies an instruction fetch request to instruction fetch stage 2. Instruction fetch stage 2 supplies an instruction pointer IP to instruction memory 1 in response to the instruction fetch request, to fetch a plurality of instructions corresponding to the instruction pointer IP from instruction memory 1. The fetched instructions M1-M4 are simultaneously supplied to the decode circuits D1-D4 included in instruction decode stage 3. The decode circuits D1-D4 decode simultaneously the supplied plurality of instructions.
The pipeline sequencer SC detects instructions which can be processed in parallel not having calculation resources and data registers compete with each other in the instructions decoded in decode circuits D1-D4 and issues the instructions which can be processed in parallel to corresponding functional units.
The functional units to which the instructions are issued execute processing in parallel in accordance with the issued instructions. Functional units 4-7 are pipelined, and processing is executed through each of the execution stages and write stages illustrated in FIG. 2.
Operations of instruction fetch stage 2, instruction decode stage 3, and the instruction executing stage (functional units 4-7) are also pipelined, so that they execute predetermined operations overlapping with each other.
As described above, it is possible to execute instructions at a higher speed by pipelining the operation of each of the stages and the functional units and by executing processing in parallel in a plurality of functional units.
Examples of a superscalar-type processor are shown in (1) S. McGeady, "The i960CA Superscalar Implementation of the 80960 Architecture", Proceedings of 35th COMPCON, IEEE 1990, pp. 232-240 and (2) R. D. Groves et. al., "An IBM second generation RISC Processor Architecture", Proceedings of 35th COMPCON, IEEE, 1990, pp. 166-172.
The prior art (1) discloses a processor having three functional units, REG. MEM, and CTRL, which is capable of executing in parallel three instructions of four instructions simultaneously fetched.
The prior art (2) discloses a processor comprising a fixed-point processor, a floating-point processor, a branch processor, and a control unit, in which four instructions are simultaneously fetched and four instructions can be simultaneously executed.
As described above, in a superscalar, a plurality of instructions are fetched and a plurality of instructions are simultaneously executed, so that it is possible to attain a processing speed higher than in a normal processor.
Referring to the structure illustrated in FIG. 2, in the case where four instructions (M1-M4; see FIG. 3) simultaneously fetched are executed in parallel in four functional units 4-7, for example, it is possible to process and execute four instructions in four clock cycles (in the case where the pipelines of functional units 4, 5, and 6 are in a waiting state until processing by functional unit 7 is completed).
While the instruction scheduler (or the pipeline sequencer SC included in the instruction decode stage) executes scheduling of instructions so that parallel processing is efficiently executed, simultaneously fetched instructions are not always simultaneously processed in functional units.
FIG. 4 is a diagram illustrating an example of issuing of instructions issued from the instruction decode stage. Issuance of instructions from the instruction decode stage will be described in the following with reference to FIG. 4.
First, in a cycle 1, the fetched four instructions are decoded. Instructions 2-4 can not be processed in parallel with an instruction 1, so that only instruction 1 is issued to a functional unit.
Instructions 2 and 3 can be simultaneously processed, while instruction 4 can not be processed in parallel with instruction 2 and/or instruction 3 because of dependent relationship such that it utilizes a processing result of instruction 2 or instruction 3, for example. In a cycle 2, only instruction 2 and instruction 3 are issued.
In a cycle 3, the remaining instruction 4 is issued. In a cycle 4, an instruction 5 and an instruction 6 of four instructions newly fetched are issued as instructions which can be simultaneously processed.
Now, the order of issuing instructions is such that an instruction whose address is smaller is issued with priority in the case where instructions can not be processed in parallel.
There is a case where instructions can not be simultaneously issued even if such a dependent relationship of data does not exist. It is the case where a branch instruction is included in the fetched instructions. An instruction succeeding the branch instruction has its validity determined depending on whether a branch is generated according to the branch instruction, so that it can not be issued until a state is determined according to the branch instruction. Whether a branch is generated according to the branch instruction is determined in the instruction decode stage. Now, the cases are considered where it can not be determined whether a branch is generated in the cycle in which the branch instruction is supplied to the instruction decode stage. The case of a conditional branch instruction is one of such cases. A specific case will be descried in the following.
"The branch instruction is an instruction that "a branch is generated in the case where the content of a register is 0, and no branch is generated in other cases".
However, the register does not have a correct value unless writing according to another preceding instruction is ended."
In such a case, it is necessary that the branch instruction has its execution delayed until writing into the register according to the preceding instruction is ended.
Specific instructions described in the following are considered as such instructions.
(1) load R1, 50 (R2)
(2) brz R1, label
(3) add R4, R5, R6
(4) sub R7, R8, R9
The instruction (1) is an instruction that the data of the address of the content of a register R2 with 50 added thereto in data memory 8 is loaded to a register R1.
The instruction (2) is an instruction that a branch to "label" is generated if the content of register R1 is 0.
The instruction (3) is an instruction that the content of a register R6 is added to the content of a register R5, and the result of the addition is written into a register R4.
The instruction (4) is an instruction that the content of a register R9 is subtracted from the content of a register R8, and the result of the operation is written into a register R7.
In the case where such four instructions are provided to the instruction decode stage, the branch instruction "brz" of instruction (2) can not determine whether a branch is generated until data is written into register R1 in accordance with the instruction "load" of instruction (1). Instructions (3) and (4) have their validity determined depending on whether a branch is generated according to the branch instruction "brz" of instruction (2). Specifically, if it is determined that a branch is generated according to instruction (2), instructions (3) and (4) are not to be executed but to be invalidated. On the other hand, if it is determined that no branch is generated according to instruction (2), the instructions (3) and (4) are valid and should be supplied to functional units to be executed. A procedure for issuing instructions which can be considered in this case is illustrated in FIG. 5.
FIG. 5 is a diagram illustrating instruction issuing conditions in the case where a branch instruction is included in the fetched instructions. Description will be given in the following on issuance of instructions in the case where the above-described branch instruction "brz" is included, with reference to FIG. 5.
In a cycle 0, instructions (1)-(4) are fetched to be provided to the instruction decode stage 3.
In a cycle 1, the instructions (1)-(4) are decoded.
In a cycle 2, instruction (1) is issued to a functional unit (functional unit 6 shown in FIG. 2) and executed. Specifically, an address of the content of register R2 with 50 added thereto is generated in cycle 2. At this time, instructions (2)-(4) are not issued to functional units and held in instruction decode stage 3. Referring to FIG. 5, (ID) indicates a held (waiting) state of each instruction in the instruction decode stage.
In a cycle 3, access to data memory 8 is performed in accordance with instruction (1). The content of register R1 is not yet determined at this time, so that instructions (2)-(4) are held in instruction decode stage 3.
In a cycle 4, writing into data register R1 is performed in accordance with instruction (1), and the content of register R1 is determined.
Whether a branch is generated according to instruction (2) is determined in accordance with that data writing, and it is determined that no branch should be generated.
In a cycle 5, instructions (3) and (4) are issued to functional units (4, 5) and executed.
In a cycle 6, the execution results of instructions (3) and (4) are written into the data register.
In the case where the content of data register R1 is determined and it is determined that a branch should be generated in accordance with instruction (2) in cycle 4, instructions (3) and (4) are not issued, these instructions are invalidated, and an operation of fetching a branch target instruction is executed in cycle 5.
In the instruction issuing method as described above, a problem arises that issuance of instructions to functional units is stopped until it is determined whether a branch is generated even in the case where no branch is generated, so that a problem arises that a blank is generated in pipelines, and instructions can not be executed at a high speed.