1. Field of the Invention
The present invention relates to a data processor, specifically, to a data processor including a pipeline processing mechanism which processes a jump instruction rapidly, and more particularly, to a data processor capable of reducing overheads of pipeline processing of the case where the jump instruction is executed by performing jump processing in the initial pipeline stage.
2. Description of the Related Art
In a conventional data processor, by dividing the processing into a plural number of steps with a flow of data processing, and processing the steps of different instructions simultaneously in respective corresponding stages of the pipeline, a mean processing time necessary for one instruction is shortened and a processing performance is improved as a whole.
However, in case of executing an instruction which disturbs an instruction processing sequence such as a jump instruction, since an instruction processing sequence is switched at executing stages of the instruction, an overhead of the pipeline processing increases and a pipeline processing can not be performed efficiently. Besides, a frequency of appearance of the jump instruction in executing practical programs is very high, thus an increase in processing speed of the jump instruction is one of the most important items to improve the performance of the data processor.
For improving the performance of the data processor, various strategies are taken to reduce the overhead in executing the instructions such as the unconditional branch instruction and conditional branch instruction. For example, a method of branch processing by predicting an instruction flow at the instruction fetch stage by using a branch target buffer, which stores a branch instruction address and branch target address in a set, is proposed (J. F. K. Lee and A. J. Smith, "Branch Prediction Strategies and Branch Target Buffer Design", IEEE COMPUTER Vol. 17, No. 1, January 1984, pp 6-22). However, in this method, since the improvement of the processing performance is largely dependent on the size of branch target buffer, a large amount of hardwares must be added to improve the performance drastically.
As a method of increasing processing speed of the branch instruction by adding a small amount of hardwares, the inventors have proposed the method, whereby the branch processing is performed by calculating the branch target address at a decoding stage (Yoshida et al. "The Gmicro/100 32-Bit Microprocessor", IEEE MICRO, Vol. 11, No. 4, PP. 20-23, 62-72, August 1991.
FIG. 1 is a block diagram showing a configuration of a jump instruction processing mechanism of a conventional data processor, which calculates a branch target address at the instruction decoding stage so as to perform jump processing as mentioned above.
In FIG. 1, numeral 351 designates an instruction fetch unit which fetches in the instructions from a memory, not shown, numeral 352 designates an instruction decoding unit which decodes the instructions taken in from the instruction fetch unit 351, and numeral 353 designates a program counter (PC) calculation unit which calculates and holds the instruction head address. Numeral 354 designates a latch (DPC) which holds a PC value of the instruction being decoded in the instruction decoding unit 352, numeral 355 designates a latch (TPC) which holds a head address value of an instruction code of the instruction being decoded in the instruction decoding unit 352, numeral 356 designates a PC adder which calculates PC calculation and branch target address calculation, numerals 357, 358 and 359 designate input/output latches (PIA, PIB, PO) of the PC adder 356, and numerals 361, 362, 363 and 364 designate data transfer paths connecting between respective blocks, which are respectively a displacement bus, an instruction length buss, an instruction address bus (IA bus) and a PC adder output bus (PO bus).
The data processor comprising the jump instruction processing mechanism constructed as shown in FIG. 1 performs the five-stage pipeline processings, an instruction fetch (IF) stage fetches an instruction from a memory storing the instruction, an instruction decoding (D) stage decoding the fetched instruction, an address calculation (A) stage calculating operand address according to the result of instruction decked, an operand fetch (F) stage pre-fetching the operand and reading/decoding a micro-instruction, and an executing (E) stage executing the instruction.
Hereupon, for simplifying the description, one unit processing is to be performed in one clock cycle in respective pipeline stages. The instruction being processed is a variable length instruction set, and the instruction decoding unit 352 decodes one instruction by dividing it into one or plural decoding processing units.
The PC calculation unit 353 is operated at the instruction decoding stage. A head address of the decoding processing unit decoded immediately before and stored in the TPC 355 is taken into the PIA 357 in respective decoding cycles, and a processing code length outputted from the instruction decoding unit 352 is taken into the PIB 358 via the instruction length bus 362. In the PC adder 356, a value of the PIA 357 and a value of the PIB 358 are added, and the addition result is written back to the TPC 355 via the PO 359 and PO bus 364. In the case where one instruction has been decoded, by writing back the addition result also to the DPC 354 via the PO bus 364, the DPC 354 holds the PC value of the decoded instruction. In this way, in the conventional data processor including the jump instruction processing mechanism constructed as shown in FIG. 1 the processing code length in the decoding cycles becomes clear only after the instruction decoding, since its instruction system is the variable length instruction set, from a view point of timing, the PC calculation unit 353 calculates the head address of the instruction being decoded in the first decoding cycle of the respective instructions.
Next, a mechanism of performing branch processing at the instruction decoding stage is described. The operation at unconditional branch instruction (BRA instruction) processing for designating the branch target address by displacement from the instruction head address is described.
At the time point of finishing the decoding cycle of the unconditional branch instruction, the PC value of the unconditional branch instruction is stored in the DPC 354. At the D stage, the branch target address is calculated in the next cycle. The PC adder 356 calculates the branch target address by adding branch displacement taken into the PIB 358 from the instruction decoding unit 352 via the displacement bus 361, and the PC value of the branch instruction taken into the PIA 357 from the DPV 354, and transfers the addition result to the instruction fetch unit 351 via the PO 359 and IA bus 363. The addition result is also written back to the TPC 355 via the PO 359 and Po bus 364 for initializing the PC calculation unit 353. The instruction fetch unit 351 fetches the branch target instruction on the bases of the branch target address taken in via the IA bus 363.
FIG. 2 shows a timing chart of processing the unconditional branch instruction (BRA instruction).
In FIG. 2, reference character In-1 designates an instruction immediately before the BRA instruction, and reference character Ibt designates an branch target instruction. As shown in FIG. 2(b), the unconditional branch instruction is decoded in the C1 cycle, as shown in FIG. 2(c) the branch target address is calculated in the C2 cycle, as shown in FIG. 2(a) the branch target instruction is fetched in the C3 cycle and as shown in FIG. 2(b) the branch target instruction is decoded in the C4 cycle. As such, since the unconditional branch instruction can be processed in 3 clock cycles by branch processing at the instruction decoding stage, the performance is improved as compared with the case wherein the branch processing is performed at the instruction executing stage. However, when considering in terms of the instruction decoding stage, there are still idle times of 2 clock cycles of the C2 and C3 cycles.
As stated above, in the conventional data processor, it is attempted to increase speed of the branch processing by adding a small amount of hardwares, by performing the branch processing at the instruction decoding stage.
In the above-mentioned conventional example, the instruction which is branch processed at the instruction decoding stage is limited to the branch instruction whose jump target address is designated in the PC relative addressing mode, and the jump instruction whose jump target address is designated by an operated designator is not subjected to jump processing at the instruction decoding stage. In the above-mentioned conventional example, also for the unconditional branch instruction, it is not branched at the instruction decoding stage, but processed at the executing stage.
Furthermore, in the conventional data processor, there is the one which includes a plural number of instruction buffers to process the conditional branch instruction rapidly, performs branch prediction and takes in the branch target instruction before fixing the branch condition related to the conditional branch instruction so as to pipeline-process the predicted instruction. For example, as one example of such a data processor, IBM System/370 Model 168-3 can be given (Umino, "Internal Design and Performance of IBM 3033 Processor", "Nikkei Electronics Books" Large general-Purpose Computer "Nikkei Macgrow-Hill", pp. 251-263, May 31, 1982).
As one example of the conventional data processor, an internal construction of the above-mentioned IBM System/370 Model 168-3 is shown in a block diagram of FIG. 3.
This conventional data processor comprises, a main memory mechanism 371 which stores instructions and operand data, a main memory control mechanism 372 which controls the main memory mechanism 371 and includes a cache and address converting mechanism and a TLB (Translation Lookaside Buffer), an instruction pre-processing mechanism 373 which performs necessary pre-processing for executing the instruction, such as decoding the instruction, generating the operand address and the like, and an executing mechanism 374 for executing the instructions. The instruction pre-processing mechanism 373 includes, two instruction buffers (IB1, IB2) 375, 376, two instruction address registers (IAR1, IAR2) 377, an instruction register 378, an instruction decoder 379, a decoded instruction register 380 and operand address registers (OAR1, OAR2) 381.
In the IBM System/370 Model 168-3 as the conventional data processor having such a configuration, the branch target instruction is fetched at branch instruction processing by utilizing two sets of instruction buffers 375, 376 and the instruction address register 377, and decoding of the predicted instruction is continued in accordance with the static branch prediction result by the instruction or a mask value (branch condition). There is also a data processor which comprises three sets of instruction buffers so as to prefetch the branch target instruction of the second conditional branch instruction to perform pipeline processing of the two conditional branch instruction efficiently as an IBM System/370 Model 3033.
As stated above, in the conventional data processor, the conditional branch instruction is processed efficiently by providing a plural number of instruction queues.
As described above, a conventional data processor performing branch processing at an instruction decoding stage, calculates the branch target address after decoding the branch instruction. And hence, it is problematic in that, at least idle times of 2 clock cycles are produced till the branch target instruction is decoded after decoding the branch instruction. It is also problematic in that, the jump processing can not be performed at the instruction decoding stage as to the jump instruction whose jump target address is designated by an operand designator.
In the conventional data processor, though a plural number of instruction buffers are provided so as to process the conditional branch instruction efficiently, it is necessary to provide three instruction buffers so as to pipeline-process the two conditional branch instructions efficiently, thus an amount of hardwares is increased.
Furthermore, in the conventional data processor, in case of performing branch processing at the instruction decoding stage, though the performance is improved when the unconditional branch instruction is not processed at the executing stage, in case of including a step execution mode in which the instruction is executed for every one instruction so as to debug a program or when exception is detected in case of not satisfying the designated boundary condition by the jump target address, it does not operate properly when it is simply constituted such that the unconditional branch instruction is not processed at the executing stage.