1. Field of the Invention
The present invention relates generally to superscalar processors and, more particularly, to a superscalar processor capable of directly transferring data used in a plurality of instructions executed in parallel between pipelines.
2. Description of the Background Art
"A superscalar" is known as one of the architectures for increasing the processing speed of a microprocessor. Instructions which can be executed simultaneously are detected out of given plurality of instructions, and the detected instructions are processed simultaneously or in parallel by a plurality of pipelines in a microprocessor using a superscalar.
FIG. 7 is a block diagram of a superscalar processor illustrating the background of the present invention. Referring to FIG. 7, a superscalar processor 20 includes an instruction fetching stage 2 for fetching a plurality of instructions stored in an instruction memory 1, instruction decoding stage 3 for decoding the instructions fetched in instruction fetching stage 2, function units 14 to 17 each having a pipeline structure, and a register file 9 for temporarily holding data used for executing the instructions. Functional units 14 to 17 can access an external data memory 8 through a data bus 11. Register file 9 is implemented with a RAM and is accessed from function units 14 to 17.
Instruction fetching stage 2 includes a program counter (not shown) and gives an address signal generated from the program counter to instruction memory 1. Designated plurality of instructions designated by the given address signal are fetched and held in instruction fetching stage 2.
Instruction decoding stage 3 receives the plurality of instructions from instruction fetching stage 2 and decodes them. Simultaneously executable instructions are detected out of the given plurality of instructions by decoding the instructions. In addition, instruction decoding stage 3 relays data between function units 14 to 17 and register file 9. Specifically, instruction decoding stage 3 reads data to be used by function units 14 to 17 for executing the given instructions from register file 9 and gives the read data to function units 14 to 17.
Each of function units 14 to 17 has a pipeline structure. Specifically, superscalar processor 20 has four pipelines implemented with four function units 14 to 17.
The four function units 14 to 17 perform predetermined arithmetic operations as described in the following, for example. Function units 14 and 15 perform integer arithmetic operations. Function unit 16 carries out loading and storing of data into data memory 8. Function unit 17 performs floating-point arithmetic operations. Each of function units 14 and 15 includes an execution stage (EXC) and a write back stage (WB) to register file 9. Function unit 16 includes an address processing stage (ADR), a memory accessing stage (MEM), and a write back stage (WB). Function unit 17 includes three execution stages (EX1, EX2, EX3) and a write back stage (WB). Generally, the execution stages perform arithmetic operations and an address calculation, and, on the other hand, the memory access stage performs reading/writing from/into data memory 8.
Superscalar processor 20 operates in response to externally applied two-phase non-overlap clock signals .phi.1 and .phi.2. Specifically, instruction fetching stage 2, instruction decoding stage 3, and various stages in function units 14 to 17 are operated in response to clock signals .phi.1 and .phi.2 under the control of pipelines. An example of two-phase non-overlap clock signals is illustrated in FIG. 6.
In operation, instruction decoding stage 3 detects simultaneously executable instructions out of given plurality of instructions and gives the detected instructions to function units 14 to 17 (according to circumstances, to some of function units 14 to 17). Function units 14 to 17 have pipeline structure, so that they can execute the given instructions simultaneously or in parallel.
Now, it is assumed that a superscalar processor has three function units (pipelines), and each function unit has an execution stage (EXC), a memory access stage (MEM), and a write back stage (WB). An example of progress of pipeline processing in this case is illustrated in FIG. 8A. Referring to FIG. 8A, it is assumed that three pipelines PL1, L2, and PL3 execute instructions 1, 2, and 3, respectively. Processing in instruction fetching stage 2 is performed in a period T1, and processing in instruction decoding stage 3 is performed in a period T2 in pipeline PL1. Processing in the execution stage, the memory access stage, and the write back stage is executed in periods T3, T4, and T5, respectively. On the other hand, in pipeline PL2, processing in instruction fetching stage 2 is started in period T2. The stages (ID, EXC, MEM, WB) are performed in periods T3 to T6, respectively, as in pipeline 1. In pipeline PL3, after processing in instruction fetching stage 2 is started in period T3, processing in respective stages is performed in periods T4 to T7. As seen from FIG. 8A, each of pipelines PL1 to PL3 executes corresponding one of the given instructions 1 to 3, so that it is understood that respective stages are made to proceed simultaneously and in parallel. However, a problem arises from the view point of time required for processing in the following case.
Referring to FIG. 8B, it is assumed that two instructions 11 and 12 are given, and they are processed by pipelines PL1 and PL2. In addition, it is assumed that the data of a result obtained by executing instruction 11 is used in processing of instruction 12. In other words, it is assumed that instruction 12 which executes its own processing using the data obtained by executing instruction 11 is given.
Conventionally, instruction 11 is executed and terminated first in such a case. Specifically, in pipeline PL1, instruction fetching stage 2 is executed in period T1, and instruction decoding stage 3 is executed in period T2. The execution stage, the memory access stage, and the write back stage are executed in periods T3, T4, and T5, respectively. Data obtained by executing instruction 11 is once stored in register file 9 illustrated in FIG. 7 according to execution of the write back stage. On the other hand, in pipeline PL2, instruction fetching stage 2 is executed in period T2, and instruction decoding stage 3 is executed in period T3. However, execution of instruction 12 is stopped in periods T4 and T5. The reason for this is that instruction 12 uses data obtained by executing instruction 11 as described above, so that it should wait for termination of execution of instruction 11. Accordingly, processing in pipeline PL2 is stopped until the write back stage in pipeline PL1 is terminated in period T5. In other words, pipeline PL2 is brought to a standby state (pipeline interlock) in periods T4 and T5.
After period T5, the data obtained by executing instruction 11 is stored in register file 9. Therefore, execution of instruction 12 is restarted in pipeline PL2 in period T6. Specifically, after instruction decoding stage 3 is executed in period T6, the execution stage, the memory access stage, and the write back stage are executed in periods T7 to T9, respectively.
As described above, after the data obtained by executing instruction 11 is once written in register file 9, register file 9 is accessed in processing of another instruction 12. In other words, the data obtained by executing processing in a pipeline PL1 is given to another pipeline PL2 through register file 9. However, as illustrated in FIG. 8B, although the data obtained by executing instruction 11 has been already obtained by processing in the execution stage in period T3, transmission of data between two pipelines PL1 and PL2 is performed through register file 9, so that pipeline PL2 must wait for termination of execution of the write back stage in pipeline PL1. As a result, a long time was required for completing execution of the instruction. In other words, the processing speed of a superscalar processor was reduced.