1. Field of the Invention
The present invention relates generally to operational processing units and, particularly, relates to operational processing units for executing an instruction in a pipeline scheme. More particularly, the present invention relates to a reduced instruction set computer (RISC) employing the pipeline scheme.
2. Description of the Background Art
Most of execution of a program in a computer is spent for a very simple instructions such as "Load", "Store", "Branch On Condition", and "Add". Many of complicated control circuits in a computer are for processing an instruction requiring a plurality of cycles and an operand within a memory extending over a page boundary. The process speed can be increased if an infrequently-used complicated operation is not treated. Therefore, a reduced instruction set computer (RISC) has been developed, for processing only frequently-used simple instructions.
RISC adopts a load/store architecture in which a memory is accessed only by "Load" instruction and "Store" instruction. All arithmetic operation instructions and logical operation instructions are executed using data stored in internal registers. Therefore, the RISC has a multiplicity of registers and includes a register file as general-purpose registers. The RISC generally has the following characteristics:
(1) Execution of an instruction in 1 machine cycle; PA1 (2) The lengths of all the instructions are the same (typically 32 bits) with simple fixed format; PA1 (3) The memory is accessed only by "load" and "store" instructions and the remaining instructions are executed with reference to the register; PA1 (4) Pipeline processing: processing of several instructions at the same time; and PA1 (5) Hand-over of a function to software, i.e., the characteristic for enhancing the performance is realized by hardware and a complicated function is assigned to software. PA1 (a) load 1r0, (1r1) PA1 (b) and 1r3, 1r2, 1r0
The most important point for enhancing the performance of an operational processing unit is single cycle execution (execution of an instruction in 1 machine cycle) and to reduce a machine cycle as far as possible. The above-mentioned characteristic that an operation is carried out only on data in the registers and the memory is accessed only by the "load/store" instruction, is adopted for the single cycle execution. An instruction of the simple, fixed format reduces the decoding time of the instruction and shortens the machine cycle. The hand-over of a function to software means that the compiler is responsible for a complicated function. The optimization function of the compiler can also rearrange a sequence of instructions to be adaptable for the pipeline.
The RISC employs a pipeline control scheme for executing an instruction at high speed. There are several kinds of structures of the pipeline, and different RISCs include different pipeline structures.
FIG. 1 is a diagram showing an example of a general functional structure of the RISC. In FIG. 1, the RISC includes an instruction memory 1 including a cache memory, for example, for storing instructions, a register file 2 including a plurality of registers for temporarily storing data, a data memory 3 for storing data, and pipeline stages of five stages 4 to 8.
The pipeline stages include an instruction fetch stage 4 for fetching an instruction from instruction memory 1, an instruction decode stage 5 for decoding the instruction fetched by instruction fetch stage 4, an execution stage 6 for executing the instruction decoded by instruction decode stage 5, a memory access stage 7 for accessing data memory 3 when the instruction decoded in instruction decode stage 5 is a memory access instruction, and a write back stage 8 for writing back the execution result of the operation instruction and the load data from data memory 3 into a corresponding register in register file 2.
Instruction memory 1 and data memory 3 includes a cache memory or the like. Instruction fetch stage 4 fetches a corresponding instruction from instruction memory 1 according to the output of a program counter (not shown) and supplies the same to instruction decode stage 5. Instruction decode stage 5 decodes a supplied instruction and reads out the contents of a corresponding register in register file 2. If the fetched instruction can be executed in the next execution stage 6, instruction decode stage 5 dispatches the decoded instruction to execution stage 6.
The instructions are executed in parallel according to the pipeline in the RISC. In some cases, there is data dependency between the instructions. For example, some operation result is utilized by the next operation instruction. In this case, instruction decode stage 5 generally delays dispatching of the instruction to execution stage 6 until the decoded instruction can be executed.
Execution stage 6 executes the supplied instruction if the decoded instruction is an operation instruction. If the decoded instruction is a branch instruction, execution stage 6 makes a determination on the branch condition. If the decoded instruction is a memory access instruction (load or store instruction), execution stage 6 calculates the effective address of data memory 3 and supplies the address to memory access stage 7.
Memory access stage 7 accesses data memory 3 including a cache memory, for example, according to the address from execution stage 6 and executes write/read of data.
The RISC operates according to two-phase, non-overlapping clock (T clock and L clock as will be described later). The RISC is pipelined and fetches a new instruction in each clock cycle. The RISC shown in FIG. 1 requires five cycles for completing execution of one instruction. However, it is pipelined so as to be able to start a new instruction in each clock cycle. The new instruction is started before the present instruction is completed.
FIG. 2 shows the pipeline operation. In FIG. 2, an instruction #1 to an instruction #3 pass through the instruction fetch stage (IF), the instruction decode stage (ID), the instruction execution stage (EXC), the memory access stage (MEM), and the write back stage (WB). The instruction #2 is fetched in cycle 2 where the instruction #1 is in the stage of instruction decoding. The instruction #3 is fetched in cycle 3 where the instruction #2 is decoded. The instruction #5 is fetched in cycle 5 where the instruction #1 is written back. In this way, the instructions are executed in parallel, so that, as a whole, one instruction can be effectively executed in one machine cycle.
In instruction execution of the pipeline scheme, the next instruction is started before the present instruction is completed. Accordingly, there may be caused data dependency between instructions. In many cases, a determination is made in instruction decode stage 5 as to whether there is data dependency between the instructions. If there is the dependency between the instructions, the next instruction cannot be executed until the dependency is eliminated, causing a disturbance in the pipeline. The dependency between the instructions will now be described.
As shown in FIG. 3, the RISC operates in response to the two-phase clock signal, i.e., T clock and L clock. The T clock phase and the L clock phase make one machine cycle. In execution stage 6, the operation is performed in T clock phase and the operation result is transmitted on a bus in L clock phase. In the case of a branch-on-condition instruction, the branch target address is calculated by an adder built in a program counter contained in instruction fetch stage 4 in T clock phase of execution stage 6.
In the case of a load or store instruction, execution stage 6 calculates an effective address in T clock phase and transmits the effective address to an address pin in L clock phase. In memory access stage 7, this address is transmitted to data memory 3 in T clock phase and data is written in or read out from data memory 3 in L clock phase. The execution result (operation result or load data) of the instruction is written into the register file in T clock phase by write back stage 8.
There are cases where an instruction is a conditioned branch instruction, where some instruction utilizes the execution result of the previous instruction and where some instruction utilizes data read out from the data memory, and in these cases there is the dependency of one instruction on another. Now consider the case of a load instruction.
The data loaded from data memory 3 is not valid until the memory access cycle is finished in memory access stage 7. Accordingly, this data loaded from memory 3 cannot be utilized in the execution stage of the next instruction. Now specifically consider the next instructions.
The instruction (a) is an instruction commanding that the data of the memory cells of the address stored in register 1r1 of register file 2 should be read out from data memory 3 and be loaded in register 1r0 in register file 2.
The instruction (b) is an instruction commanding that a logical product of the data stored in registers 1r2 and 1r0 of register file 2 should be taken and the result thereof should be stored in register 1r3 of register file 2.
As shown in FIG. 4, the instruction (a) (load instruction) requires five cycles for completion thereof. The contents of register 1r0 are not determined until the write back stage WB of this instruction (a) (load instruction) is completed. Normally, in register file 2, data is written in T clock phase and data is read out in L clock phase. Instruction decode stage 5 and write back stage 8 can access register file 2. Execution stage 6 cannot access register file 2.
If the write back stage (cycle 5) simply waits until the result of the instruction (a) (load instruction) is written into register 1r0 of register file 2, the instruction (b) needs to read out the contents of register 1r0 of register file 2 in the sixth cycle and dispatches the same to execution stage 6 in FIG. 4. Accordingly, execution of the instruction (b) (AND instruction) is delayed by three machine cycles. As a result, slots are caused in the pipeline, decreasing the speed of processing the instruction.
In order to minimize the occurrence of slots in the pipeline due to the dependency of one instruction on another as stated above, in some cases, there is provided hardware called a bypass architecture or a forwarding architecture. In this architecture, necessary operand data is supplied to an arithmetic and logic unit using another path in order to minimize the halt in executing an instruction if the operand data of the instruction cannot be utilized although the instruction reaches the execution stage and the processor is ready to execute the instruction. The bypass architecture will be briefly described in the following.
There are two kinds of bypass architectures (or forwarding architectures), i.e., a load bypass architecture for data loaded from data memory 3 and a result bypass architecture for the result operated in the arithmetic and logic unit.
FIG. 5 is a diagram schematically showing a structure of the bypass architecture. In FIG. 5, a bypass logic 11 is provided in the data bus, for controlling the data transfer between a register file 10 and an arithmetic and logic unit 12. Register file 10 includes not only a general-purpose register (register file 2 in FIG. 1) but also a register for pipeline bypass and an I/O register for temporarily storing input/output data of data memory 3. Data of register file 10 is read out to a first source bus Src1 and a second source bus Src2 according to first and second source operands contained in the instruction. Writing back of data into register file 10 is carried out through bypass logic 11.
Bypass logic 11 includes a latch for latching the load data from data memory 3 and the operation result (data on the result bus "result") from arithmetic and logic unit 12. Bypass logic 11 compares two register sources (source operands) of the current instruction with the destination operand of the preceding instruction and makes a determination as to whether a bypass operation is needed. If it is determined that the bypass operation is needed, bypass logic 11 transmits the latched data onto source bus Src1 or Src2 without reading out the same from register file 10. Bypass logic 11 also carries out the bypass operation if the current instruction needs the operation result of the preceding instruction. Only the bypass operation of the load data will be described in the following. The load data latched by bypass logic 11 is written back in a corresponding register in the write back cycle if it is necessary to write back the same into the corresponding register within register file 10.
FIG. 6 is a diagram more specifically showing the structure of the bypass logic. The structure of the bypass architecture is described, for example, in M. Horowitz: "MIPS-X: A 20-MIPS Peak, 32-bit Microprocessor with On-Chip Cache", IEEE Journal of Solid-State Circuits, vol. SC-22, No. 5, October 1980, pp. 790-797.
In FIG. 6, bypass logic 11 includes a register latch 111 for temporarily storing a first source operand (source 1) of the current instruction, a register latch 112 for temporarily storing a second source operand (source 2) of the current instruction, a register 113 for storing a destination operand (destination) of the preceding instruction, a comparator 110 for comparing the contents of register latches 111 and 112 with the contents of register 113, and a selection circuit 114 responsive to the output of comparator 110 for transmitting data (latch data) of an I/O register 101 contained in register file 10 to a source bus Src1 or Src2. In this case, generally, in bypass logic 11, the source operand of the current instruction is compared with destinations of two preceding instructions, and two registers 113 are provided. However, only one register for destination is shown in FIG. 6 in order to simplify the description.
I/O register 101 latches the data loaded from data memory 3 and temporarily stores data to be written into data memory 3 at the time of storing. The load data of I/O register 101 is written into a corresponding register of register file 2 by write back stage 8. If the contents of the corresponding register of register file 2 are updated by an operation before the writing of the data, the load data latched by I/O register 101 is discarded and is not written into the corresponding register. A brief description will be made below of the operation.
The source 1 and the source 2 of the source operands contained in the fetched instruction are stored into register latches 111 and 112 by instruction decode stage 5. The destination operand of the preceding instruction is also stored in register 113 by instruction decode stage 5. Comparator 110 compares the source operands stored in register latches 111 and 112 with the operand stored in register 113. If a coincidence therebetween is detected, comparator 110 generates a control signal to selection circuit 114. Selection circuit 114, in response to the control signal from comparator 110, transmits the data being latched in I/O register 101 onto source bus Src1 or Src2 corresponding the source operand for which the coincidence is detected.
The source buses are coupled to arithmetic and logic unit 12 as shown in FIG. 5. Accordingly, the data loaded in the memory access cycle is bypassed to arithmetic and logic unit 12 included in execution stage 6 without being stored in register file 2. As a result, in practice, slots in the pipeline can be reduced compared with the case where the load data is written into register file 2 and then the data is read out again.
Even if such a bypass architecture is employed, however, the slots in the pipeline cannot be completely eliminated if the operation instruction is executed after the load instruction and the operation instruction uses the data loaded by the load instruction. That is, as shown in FIG. 7, the data loaded by the load instruction has been latched into I/O register 101 when the memory access cycle MEM of cycle 4 is completed. If the data latched into I/O register 101 is bypassed to arithmetic and logic unit 12, instruction (FIG. 7(b)) can use the determined operand data and can be carried out in cycle 5 for the first time. Accordingly, in cycle 4, a slot (pipeline interlock) is caused in the pipeline of the operation instruction following the load instruction, reducing the processing speed.
The bypass logic is contained in instruction decode stage 5. Therefore, in cycle 5, instruction decode stage 5 does not carry out decoding again and execution of the operation is being carried out by execution stage 6.
As stated above, in order to cope with a case where a slot is caused in the pipeline due to data dependency of an instruction such as an operation instruction following a load instruction, a determination is made as to whether the instruction can be executed in instruction decode stage 5 and the pipeline is stalled (supplying the instruction into execution stage 6 is delayed) according to the result of the determination. Alternatively, a "NOP" instruction is interposed between the instructions one of which is dependent on another, by a compiler or the like in advance so that such a pipeline stall is not caused.
Stalling of the pipeline, however, causes a reduction in the processing speed, and interposing of the idle instruction, that is, "NOP" commanding no operation, also causes a decrease in the processing speed.