1. Field of the Invention
This invention pertains to a central processing unit for use in a computer system, and more particularly, to a microprocessor having a pipeline controller.
2. Description of the Related Arts
Recent processors operating on a RISC (Reduced Instruction Set Computer) basis have a pipeline architecture without an exception. Such a processor is called a RISC pipeline processor.
A data hazard, a structural hazard and a control hazard are three 3! factors preventing the performance of a RISC pipeline processor from being improved, which must be eliminated for expediting a processor operation.
It is believed that, of the functions peculiar to a RISC pipeline processor, a function called a delayed branch is effective in eliminating a control hazard.
A pipeline process is such that instructions stored at contiguous addresses in a memory are inputted continuously to a pipeline. Hence, after a branch instruction is inputted to a pipeline, instructions executed when the branch instruction does not generate a branch are inputted continuously. And when the branch instruction generates a branch, after instructions succeeding the branch instruction already inputted in the pipeline are canceled, branch target instructions are newly inputted to the pipeline. Accordingly, when the branch instruction generates a branch, a cancellation of instructions in a pipeline causes a problem of delaying a pipeline process.
A solution for this problem is known to exist, which is the use of a function of generating a branch without canceling instructions already stored in a pipeline by inputting, into a pipeline after a branch instruction, instructions originally supposed to be executed before the branch instruction but not dependent on a branch instruction.
Because this function causes other instructions to be executed after a branch instruction is executed and then branch target instructions based on a branch generated by the branch instruction to be executed, it appears that a branch instruction is delayed. Thus, this function is called a delayed branch function. Further, a branch instruction related to a delayed branch function is called a delayed branch instruction, and an instruction not dependent on a delayed branch instruction and executed after a delayed branch instruction is called a delay slot instruction.
The optimization function of a compiler executed on a RISC processor realizes a delayed branch function, namely an operation of inserting a delay slot instruction after a delayed branch instruction.
Here, when the execution of a delayed branch instruction for generating a branch has been completed, for example, in case of an exception event generated by a fault or an interruption, the execution of a delay slot instruction and a branch target instruction has not been completed yet. These instructions whose execution has not yet been completed do not have mutually contiguous storage addresses on a memory. Therefore, for executing an exception process for dealing with the generation of an exception event, and for properly resuming an original program, it is necessary to save in a stack the program counter values of both the branch target instruction and the delay slot instruction at the instant when the exception event is generated. Furthermore, the existence of a plurality of delay slot instructions requires their respective program counter values be saved in a stack.
Such a circumstance has a problem in increasing the amount of time necessary for a processor operation to shift from an original program to an exception process and from an exception process to an original program. That is, because a stack is generally formed on a main storage memory, the more the number of times for accessing a stack on a main storage memory accompanying a memory wait, the more time it takes for a memory wait, thereby delaying the process. In addition, when the processor operation shifts from an exception process to an original program, it is more complex to retrieve a plurality of program counter values and to execute instructions corresponding to the addresses specified by the plurality of program counter values, than to retrieve a singularity of a program counter value and to execute an instruction corresponding to the address specified by the singularity of a program counter value, thus rendering hardware complicated.
The above problems are explained more concretely by referring to FIG. 1 through FIG. 4.
FIG. 1 is a first one of explanatory diagrams illustrating in a four 4! part series a conventional exception process executed for an exception event generated while a delay slot instruction is being executed.
FIG. 1 assumes a case in which a delay slot instruction is an ineffective instruction, when a delay slot instruction is executed and then a branch target instruction corresponding to the delayed branch instruction is executed, after a delayed branch instruction for generating a branch is executed.
FIG. 2 is a second one of explanatory diagrams illustrating in a four 4! part series a conventional exception process executed for an exception event generated while a delay slot instruction is being executed.
FIG. 2 shows a step 1 of decoding a delay slot instruction in which a delay slot instruction is detected as an ineffective instruction.
FIG. 2 also shows a step 2 of saving in a stack the program counter values of a branch target instruction and a delay slot instruction.
FIG. 3 is a third one of explanatory diagrams illustrating in a four 4! part series a conventional exception process executed for an exception event generated while a delay slot instruction is being executed.
FIG. 3 shows a step 3 of executing an exception process handler corresponding to the generation of an ineffective instruction exception, which eliminates the cause of the exception event.
FIG. 4 is a fourth one of explanatory diagrams illustrating in a four 4! part series a conventional exception process executed for an exception event generated while a delay slot instruction is being executed.
FIG. 4 shows a step 4 of executing a delay slot instruction, after retrieving from the stack the program counter values of the branch target instruction and the delay slot instruction. Because the exception process handler has eliminated the cause of the exception event, the ineffective instruction exception does not generate another exception event.
FIG. 4 also shows a step 5 of executing a branch target instruction.
Thus, the execution of an exception process requires two 2! program counter values, i.e. that of the branch target instruction and that of the delay slot instruction, to be saved in and retrieved from a stack before and after executing the exception process.
Although the above description relates to a problem arising in an exception process when an ineffective instruction exception is detected while a delay slot instruction is being executed, a similar problem arises when any cancellation type exception other than an ineffective instruction exception is detected while a delay slot instruction is being executed. A cancellation type exception is defined as an exception event by which: the execution of an instruction (a delay slot instruction in the above example) in which an exception event is detected is once canceled; and the execution of the instruction is resumed after the execution of an exception process handler.
In such a case, the operation of writing data into a register or a memory by a delay slot instruction is canceled, upon detecting an exception event. The execution of an exception process for a cancellation type exception also requires the program counter values of the delay slot instruction and branch target instruction immediately succeeding the delay slot instruction to be saved in a stack before executing the exception process, upon detecting an exception event.
As well, a problem similar to the one described above arises, when a completion type exception is detected while a delayed branch instruction is being executed. A completion type exception is defined as an exception event by which: the execution of an instruction (a delayed branch instruction in the above example) in which an exception event is detected is completed; and the execution of an instruction succeeding the instruction in which an exception event is detected is started after the execution of an exception process handler.
In such a case, the operation of writing data into a register or a memory by a delay slot instruction is completed, upon detecting an exception event. The execution of an exception process for a completion type exception also requires the program counter values of the delay slot instruction succeeding the delayed branch instruction and the branch target instruction to be executed immediately succeeding the delayed branch instruction to be saved in a stack before executing the exception process, upon detecting an exception event.
When a completion type exception is detected upon executing the last delay slot instruction, (which is the delay slot instruction itself when there is only one 1! delay slot instruction,) such a problem as described above does not arise, because the program counter value of the branch target instruction to be executed immediately succeeding the delay slot instruction need only be saved in a stack.
When a cancellation type exception is detected upon executing a delayed branch instruction, such a problem as described above does not arise, either, because the program counter value of the delayed branch instruction need only be saved in a stack.
FIG. 5 is a circuit diagram illustrating a partial configuration of a pipeline controller for use in a conventional pipeline processor.
More specifically, FIG. 5 shows the configuration of a program counter pipeline (a PC pipeline) of a pipeline controller.
Generally, a pipeline processor is such that each of an instruction fetch module, an instruction decode module, an instruction execute module, a memory access module and a write back module jointly forms a pipeline. Each module parallelly executes its processing for a different instruction in a single instruction cycle.
A program counter value that an adder 100 (hereafter referred to as a PC.sub.-- adder 100) sequentially increments by two 2! from a predetermined initial value flows successively through an instruction fetch stage program counter value retaining unit 101 (hereafter referred to as a PC.sub.-- IF 101), an instruction decode stage program counter value retaining unit 102 (hereafter referred to as a PC.sub.-- ID 102), an instruction execute stage program counter value retaining unit 103 (hereafter referred to as a PC.sub.-- EX 103), a memory access stage program counter value retaining unit 104 (hereafter referred to as a PC.sub.-- MA 104) and a write back stage program counter value retaining unit 105 (hereafter referred to as a PC.sub.-- WB 105), in synchronization with a clock signal inputted via the four 4! AND gates, the AND.sub.-- IF 111, the AND.sub.-- ID 112, the AND.sub.-- EX 113 and the AND.sub.-- MA 114, each having an inverter at one 1! of its two 2! input terminals not receiving the clock signal. More concretely, the PC.sub.-- adder 100 sequentially increments a program counter value by two 2! through feeding back to itself (i.e. the PC.sub.-- adder 100) a program counter value outputted from the PC.sub.-- IF 101.
As a result, each of an instruction fetch module, an instruction decode module, an instruction execute module, a memory access module and a write back module (none of which is shown) executes its process related to an instruction specified by the program counter value retained in the corresponding one of the program counter value retaining units, i.e. the PC.sub.-- IF 101, the PC.sub.-- ID 102, the PC.sub.-- EX 103, the PC.sub.-- MA 104 and the PC.sub.-- WB 105.
The instruction fetch module fetches an instruction from a memory. The instruction decode module decodes a fetched instruction. The instruction execute module executes a decoded instruction. The memory access module accesses a memory for reading from or writing into a memory operand data. The write back module writes into a register contained in the inside of a processor chip operand data obtained from the memory by the memory access module or operand data processed by the instruction execute module.
The OR gates, i.e. the OR.sub.-- IF 121, the OR.sub.-- ID 122 and the OR.sub.-- EX 123, as well as wait signals if.sub.-- wait, id.sub.-- wait, ex.sub.-- wait and ma.sub.-- wait will be described later in detail.
A program counter pipeline as described above is necessary for retaining the program counter values of respective modules e.g. when any of the respective modules generate an exception process.
A pipeline controller may also comprise an operand pipeline for transmitting to respective modules instruction operands, in addition to the program counter pipeline.
FIG. 6 is an explanatory diagram illustrating conventional pipeline processes without a memory wait.
More specifically, FIG. 6 shows how an instruction fetch module, an instruction decode module, an instruction execute module, a memory access module and a write back module (none of which are shown here) parallelly execute respective execution stages corresponding to instructions A through D in each of instruction cycles 1 through 8.
In instruction cycle 1, the instruction fetch module (not shown) executes an instruction fetch stage (hereafter referred to as an IF stage) for instruction A.
In instruction cycle 2, the instruction decode module (not shown) executes an instruction decode stage (hereafter referred to as an ID stage) for instruction A, and the instruction fetch module executes an IF stage for instruction B succeeding instruction A.
In instruction cycle 3, the instruction execute module (not shown) executes an instruction execute stage (hereafter referred to as an EX stage) for instruction A, the instruction decode module executes an ID stage for instruction B, and the instruction fetch module executes an IF stage for instruction C succeeding instruction B.
In instruction cycle 4, the memory access module (not shown) executes a memory access stage (hereafter referred to as an MA stage) for instruction A, the instruction execute module executes an EX stage for instruction B, the instruction decode module executes an ID stage for instruction C, and the instruction fetch module executes an IF stage for instruction D succeeding instruction C.
In instruction cycle 5, the write back module (not shown) executes a write back stage (hereafter referred to as a WB stage) for instruction A, the memory access module executes an MA stage for instruction B, the instruction execute module executes an EX stage for instruction C, and the instruction decode module executes an ID stage for instruction D. As well, the instruction fetch module becomes free to start executing an IF stage for instruction E (not shown) succeeding instruction D.
In instruction cycle 6, the write back module executes a WB stage for instruction B, the memory access module executes an MA stage for instruction C, and the instruction execute module executes an EX stage for instruction D. As well, the instruction fetch module becomes free to start executing an IF stage for instruction F (not shown) succeeding instruction E.
In instruction cycle 7, the write back module executes a WB stage for instruction C, and the memory access module executes an MA stage for instruction D. As well, the instruction fetch module becomes free to start executing an IF stage for instruction G (not shown) succeeding instruction F.
In instruction cycle 8, the write back module executes a WB stage for instruction D. As well, the instruction fetch module becomes free to start executing an IF stage for instruction H (not shown) succeeding instruction G.
The execution of an IF stage and an MA stage causes the above pipeline process to perform a data read/write operation to a memory. Such a memory could be either one contained in the inside of a processor chip or one connected to the outside of a processor chip, which are hereafter referred to as an internal memory and an external memory, respectively.
Because it is generally difficult to access an external memory at a time interval equivalent to a pipeline pitch (=an instruction cycle), a memory wait arises.
FIG. 7 is an explanatory diagram illustrating conventional pipeline processes with a memory wait.
Because the operation in case that an IF stage generates a memory wait is almost identical to the operation in case that an MA stage generates a memory wait, FIG. 7 shows a case in which an MA stage generates a memory wait, for the sake of explanatory simplicity. FIG. 7 further shows that IFw, IDw, EXw and MAw indicate a memory wait, upon executing an IF stage, an ID stage, an EX stage and an MA stage, respectively.
When a memory wait arises in the MA stage for instruction A, the MA stage supposed to end in instruction cycle 4 is repeated in instruction cycle 5 due to the memory wait. Because instruction cycle 4 aborts the execution of the MA stage for instruction A, instruction cycle 5 repeats the execution of respective stages in instruction cycle 4 for the successive instruction stream. That is, instruction cycle 5 executes the MA stage for instruction A, the EX stage for instruction B, the ID stage for instruction C and the IF stage for instruction D, which are begun to be executed but aborted in instruction cycle 4.
The structure shown in FIG. 5 realizes the above operation. A memory access wait signal ma.sub.-- wait asserted at the timing shown in FIG. 7 turns off the four 4! AND gates, the AND.sub.-- MA 114, the AND.sub.-- EX 113, the AND.sub.-- ID 112 and the AND.sub.-- IF 111 via the three 3! OR gates, the OR.sub.-- EX 123, the OR.sub.-- ID 122 and the OR.sub.-- IF 121. Since this prevents instruction cycle 5 from receiving the clock signal, each of the PC.sub.-- MA 104, the PC.sub.-- EX 103, the PC.sub.-- ID 102 and the PC.sub.-- IF 101 maintains its status of instruction 4.
Thus, the entire pipeline has a wait time having a duration equivalent to one 1! instruction cycle, thereby delaying their currently executed processes for that amount of time.
This is evident from a comparison between FIG. 6 illustrating pipeline processes without a memory wait and FIG. 7 illustrating pipeline processes with a memory wait. Although instruction cycle 6 completes the execution e.g. of instruction B when a memory wait does not arise as shown in FIG. 6, instruction cycle 7, instead of instruction cycle 6, completes the execution of instruction B when a memory wait does arise as shown in FIG. 7.
A memory wait arising in the IF stage asserts an instruction fetch wait signal if.sub.-- wait, thereby turning off one 1! AND gate, the AND.sub.-- IF 111 via one 1! OR gate, the OR.sub.-- IF 121. This sets the PC.sub.-- IF 101 in a wait status.
As well, a memory wait may arise in the ID stage or the EX stage.
When a wait arises in the ID stage, a multi-cycle wait generated by the execution of a multi-cycle instruction such as a multiplication instruction must be considered.
A memory wait arising in the ID stage asserts an instruction decode wait signal id.sub.-- wait, thereby turning off two 2! AND gates, the AND.sub.-- IF 111 and the AND.sub.-- ID 112 via one 1! OR gate, the OR.sub.-- ID 122. This sets the PC.sub.-- IF 101 and the PC.sub.-- ID 102 in a wait status.
When a wait arises in the EX stage, a load use interlock generated when a next instruction operates an operand loaded from a memory must be considered.
A memory wait arising in the EX stage asserts an instruction decode wait signal ex.sub.-- wait, thereby turning off three 3! AND gates the AND.sub.-- IF 111, the AND.sub.-- ID 112 and the AND.sub.-- EX 113 via one 1! OR gate, the OR.sub.-- EX 123. This sets the PC.sub.-- IF 101, the PC.sub.-- ID 102 and the PC.sub.-- EX 103 in a wait status.
FIG. 8 is an explanatory diagram illustrating conventional pipeline processes with a memory wait in the MA stage of every instruction.
More specifically, FIG. 8 shows a case in which the whole pipeline processes experience a conspicuous delay.
Generally, a processor chip internally generates an operand address for accessing an external memory, but a constant time lag exists until the external memory recognizes the operand address. The constant time lag is the sum of a signal transmission delay in a bus and a bus connection circuit within the processor chip, a signal transmission delay in an output buffer driver circuit, a signal processing delay in an output buffer circuit itself for driving an external terminal of the processor chip having a large capacitance, and a signal transmission delay in an address bus outside of the processor chip due to its wiring capacitance and resistance.
As well, another time lag exists due to a signal transmission delay from the instant at which the external memory outputs operand data until the instant at which the processor chip receives the operand data.
FIG. 9 is an explanatory diagram illustrating conventional pipeline processes with one 1! wait, upon accessing an external address bus and an external data bus.
The processor chip must internally determine accessed operand data before the completion of the MA stage. Therefore, assuming the time lag when the processor chip accesses the external memory to be a half of the instruction cycle each for the operand address and the operand data, even if a memory wait (one 1! wait) having a duration of one 1! instruction cycle is inserted in the MA stage of each instruction for accessing the external memory, as shown in (a) and (b) of FIG. 9, the effective access time for the external memory is one 1! instruction cycle.
That is, as described earlier, even though one 1! wait is inserted in the MA stage, since it is difficult to access the external memory at a time interval of a pipeline pitch (=one 1! instruction cycle), because of the existence of the above time lags, the effective access time for the external memory ends up in one 1! instruction cycle.
FIG. 10 is an explanatory diagram illustrating conventional pipeline processes with two 2! waits, upon accessing an external address bus and an external data bus.
As explained in the description of FIG. 9, conventional pipeline processes require a further memory wait. For instance, by inserting two 2! waits in the MA stage for each instruction for accessing the external memory, i.e. by extending the duration for the MA stage to three 3! instruction cycles, as shown in (a) and (b) of FIG. 10, the effective access time for the external memory becomes two 2! instruction cycles.
Such a problem occurs not only in the MA stage but also in the IF stage.
To summarize the above, the prior art has a problem in increasing the processing delay of all the pipeline processes, because the signal delays in the interface between a processor chip and an external memory reduce the effective access time for accessing the external memory in pipeline processes, which necessitates an extra memory wait for compensating the reduction.