1. Field of the Invention
The present invention relates to a data processing apparatus and data processing method for use in a digital computer or the like, and more particularly to a data processing apparatus having a pipelined architecture for concurrently carrying out instruction fetching and decoding and so on, and a data processing method used in the data processing apparatus having the pipelined architecture.
2. Description of the Related Art
A conventional data processing apparatus having a pipelined architecture is disclosed in Japanese Patent Publication Laying-Open No. 1990-42534, for example. FIG. 1 shows a block diagram of the conventional data processing apparatus. It is to be noted that this block diagram is a modified version of the original diagram included in the above publication, to clarify the differences between the conventional apparatus and an apparatus according to the present invention.
In FIG. 1:
Numeral 51 denotes an instruction fetch unit for fetching instruction codes stored in a memory not shown.
Numeral 52 denotes an instruction decode unit for decoding the instruction codes fetched and outputting control data.
Numeral 53 denotes an operand fetch unit for fetching operands stored in a memory or I/O device not shown (hereinafter referred to as external operands).
Numeral 54 denotes an execution unit including arithmetic units such as a floating-point unit 54a and an integer unit 54b having general registers or floating-point data registers, not shown, for carrying out floating-point operations and integer operations.
Numeral 55 denotes an operand store unit for writing the external operands, that is writing results of operations to a memory or I/O device. Details and operations of the operand store unit 55 will be omitted from the following description.
Numeral 56 denotes an I/O bus for connecting the data processing apparatus to the memory, I/O device and the like.
Numeral 57 denotes a bus control unit for arbitrating an instruction code fetch request from the instruction fetch unit 51, an external operand fetch request from the operand fetch unit 53, and an external operand write request from the operand store unit 55 and controlling the I/O bus 56.
Numeral 58 denotes pipeline control unit for controlling the units 51-55.
The instruction decode unit 52, specifically, decodes an instruction code fetched by the instruction fetch unit 54, and outputs a control data regarding execution of an operation (hereinafter referred to as operation control data) to the operand fetch unit 53. When the decoded data shows a necessity for fetching an external operand, the instruction decode unit 52 computes an operand address and outputs a control data regarding external operand fetching (hereinafter referred to as fetch control data), along with the operation control data, to the operand fetch unit 53.
When a fetch control data and an operation control data are outputted from the instruction decode unit 52, the operand fetch unit 53 fetches an external operand based on an operand address included in the fetch control data, and transmits the external operand fetched and the operation control data received from the instruction decode unit 52 to the execution unit 54. When only an operation control data is outputted from the instruction decode unit 52, the operand fetch unit 53 transmits only this operation control data to the execution unit 54.
The execution unit 54 carries out floating-point operations and integer operations based on the operation control data received from the operand fetch unit 53 and using the external operands fetched by the operand fetch unit 53 and/or operands stored in the general registers or floating-point data registers described later (hereinafter referred to as register operands). Results of the operations are stored in the general registers or floating-point data registers, or outputted to the operand store unit 55 for writing to the memory or the like.
The following numbers of clock cycles are required for the foregoing units 51-54 to carry out a single process (with one machine cycle assumed to be one clock cycle):
(a) one clock cycle per word (32 bits) for the instruction fetch unit 51, PA1 (b) one clock cycle per halfword (16 bits) for the instruction decode unit 52, PA1 (c) one clock cycle per one operand fetch for the operand fetch unit 53. (However, one clock is required to transmit the operation control data to the execution unit 54 even when an external operand is not fetched.) PA1 (d) two clock cycles per floating-point operation to be executed, and one clock cycle per integer operation to be executed, for the execution unit 54. (However, the floating-point operation and integer operation may be executed concurrently.)
The instructions executed by the data processing apparatus have formats as shown in FIGS. 6(a) and (b), for example.
The instruction in the format shown in FIG. 6(a) has a length of two halfwords. The first halfword is composed of an operation code OP1 and a source operand addressing designation SRC. The second halfword is composed of an operation code OP2 and a destination operand addressing designation DEST.
The instruction in the format,shown in FIG. 6(b) has a length,of one halfword, which is composed of an operation code OP, and a source operand and destination operand addressing designation SRC-DEST.
Specific examples of operations of the conventional data processing apparatus having the above construction will be described hereinafter. The following operations are based on Instruction 1 and Instruction 2 which are instructions in the format shown in FIG. 6(a) for floating-point operations, in which source operands are external operands, and destination operands are floating-point data register operands, and Instruction 3 which is in the format shown in FIG. 6(b) for an integer operation, in which the source operand and destination operand are both register operands.
FIG. 2 is a timing chart showing operations of the conventional data processing apparatus, which shows instructions executed by the instruction fetch unit 51, instruction decode unit 52, operand fetch unit 53 and execution unit 54 at every clock cycles t1, t2 and so on.
Clock Cycle t1:
The instruction fetch unit 51 fetches Instruction 1 (1:IF).
Clock Cycle t2:
The instruction decode unit 52 decodes the first halfword in Instruction 1 (1-1:DEC).
Clock Cycle t3:
The operand fetch unit 53 fetches the source operand for Instruction 1 from the memory or the like in response to the result of decoding of the first halfword in Instruction 1 (1-1:OF). The instruction decode unit 52 decodes the second halfword in Instruction 1 (1-2:DEC). The instruction fetch unit 51 fetches Instruction 2 (2:IF).
Clock Cycle t4:
The operand fetch unit 53 just transfers the operation control data outputted from the instruction decode unit 52 to the execution unit 54 since the destination operand is a register operand. That is, no operation takes place in relation to external operand fetching (1-2:nop). The instruction decode unit 52 decodes the first halfword in Instruction 2 (2-1:DEC).
Clock Cycle t5:
The floating-point unit 54a of the execution unit 54 executes a first step of the floating-point operation based on Instruction 1 (1:FP). The operand fetch unit 53 fetches the source operand for Instruction 2 from the memory or the like in response to the result of decoding of the first halfword in Instruction 2 (2-1:OF). The instruction decode unit 52 decodes the second halfword in Instruction 2 (2-2:DEC). The instruction fetch unit 51 fetches Instruction 3 (3:IF).
Clock Cycle t6:
The floating-point unit 54a of the execution unit 54 executes a second step of the floating-point operation based on Instruction 1 (1:FP). The operand fetch unit 53 just transfers the operation control data outputted from the instruction decode unit 52 since the destination operand is a register operand (2-2:nop). The instruction decode unit 52 decodes Instruction 3 (3:DEC).
Clock Cycle t7:
The floating-point unit 54a of the execution unit 54 executes a first step of the floating-point operation based on Instruction 2 (2:FP). The operand fetch unit 53 just transfers the operation control data outputted from the instruction decode unit 52 since the source operand and destination operand are both register operands (3:nop).
Clock Cycle t8:
The floating-point unit 54a of the execution unit 54 executes a second step of the floating-point operation based on Instruction 2 (2:FP). Concurrently therewith, the integer unit 54b executes the integer operation based on Instruction 3 (3:INT).
As described above, Instruction 1, for example, is fetched at clock cycle t1, and the floating-point operation is completed at clock cycle t6. Where a pipelined architecture is employed to execute subsequent Instructions , 2 and 3, the units 51-55 operate concurrently to realize, in effect, a processing speed corresponding to two clock cycles per instruction.
Further, Instructions 2 and 3 may be completed simultaneously at clock cycle t8 by concurrently operating the floating-point unit 54a and integer unit 54b of the execution unit 54.
However, with a data processing apparatus having a pipelined structure, the improved processing speed due to the pipelined structure is not obtained when branching caused by a conditional jump, an unconditional jump, or a subroutine call occurs. That is, execution of a branch instruction instantly effects what is known as a pipeline flush, which invalidates the instructions already fetched and decoded and operands fetched from the memory. Subsequently, branched processing is carried out for fetching and decoding an instruction, fetching an operand, and executing the instruction. Until this instruction is executed, effective concurrent operations of the units 51-55 do not take place.
A penalty accompanying the pipeline flush increases with the number of pipeline stages.
In the conventional data processing apparatus described above, the operand fetch unit 53 fetches an operand in an independent pipeline stage. Thus, the conventional apparatus has a large number of pipeline stages, which results in the disadvantage of little improvement in the processing speed when branching instructions are executed frequently.
It will be futile not to employ the pipelined architecture for operand fetching and execution of arithmetic operations just in order to decrease the number of pipeline stages. It is because an execution could not be carried out until all the operands necessary for the execution are fetched, and operands for a next instruction could not be fetched until the execution currently in progress is completed. This results in a reduced processing speed regardless of whether a branching instruction is executed or not.
Further, when the floating-point unit 54a and integer unit 54b concurrently execute a preceding instruction and a succeeding instruction, an overflow and a trap of an execution exception related execution of the preceding instruction may occur. In such a case, the conventional data processing apparatus has a disadvantage of occasionally failing to effect proper exception processing and restart the succeeding instruction after the exception processing.
That is, when a trap due to an execution exception occurs with the preceding instruction and exception processing is carried out, resources such as an operand and stack pointer necessary for the exception processing may be rewritten by execution or operand fetching for the succeeding instruction proceeding concurrently with he preceding instruction. (The stack pointer may be rewritten only by fetching of an operand necessary for execution of the succeeding instruction.) In such a case, proper exception processing is not carried out.
The succeeding instruction executed concurrently with the preceding instruction must be executed all over again based on results of the exception processing. How ever, the succeeding instruction cannot be restarted properly if the operand or stack pointer is renewed by the first execution, that is if the succeeding instruction indicates the same address for reading and writing or calls for a stack operation.
A solution to this problem has been proposed in U.S. Pat. No. 4,879,676, for example. According to this technique, occurrence of an execution exception is predicted at a step of exponential processing carried out in an initial stage of a floating-point operation. Only when non-occurrence of an execution exception is assured, the remaining steps of the floating-point operation are executed concurrently with an integer operation on a succeeding instruction.
However, this technique requires large-scale hardware for predicting occurrence of an execution exception, resulting in a greatly increased hardware cost. In addition, the data processing apparatus must have an increased number of circuits, which makes it difficult to increase clock frequency and to increase processing speed to a large degree.