1. Field of the Invention
This invention relates to a processor, and, more particularly, to a processor which bypasses data upon pipeline processing.
2. Description of the Related Art
FIG. 8 is a diagram showing an internal structure of a conventional processor, and FIG. 9 is a diagram showing pipeline stages in each pipeline of the processor shown in FIG. 8.
As shown in FIG. 9, the processor of FIG. 8 has five pipeline stages, namely, xe2x80x9cI stagexe2x80x9d which is the stage for fetching an arithmetic instruction, xe2x80x9cR stagexe2x80x9d for decoding an instruction and reading a register out of a register file, xe2x80x9cA stagexe2x80x9d for an arithmetic operation, xe2x80x9cD stagexe2x80x9d for accessing a data cache and xe2x80x9cW stagexe2x80x9d for writing back arithmetic results to the register file. In this processor, the xe2x80x9cA stagexe2x80x9d is in charge of judging conditions of conditional branch instruction and determine whether branch is taken or not, in addition to arithmetic operation of instructions.
As shown in FIG. 8, the processor includes, mainly, an instruction fetch unit 110, register file 120, a bypass select logic circuit 130, two pipelines 140, 150, and registers RG101 through RG106.
Those two pipelines 140, 150 form an arithmetic unit. These two pipelines 140, 150 can simultaneously execute instructions. That is, this processor is a 2 way super scalar processor.
In the example of FIG. 8, the pipeline 140 includes an ALU 142, registers RG110 to 113, and bypass multiplexers 144, 146, and executes an ALU arithmetic instruction. The pipeline 150 includes a branch unit 152, registers RG120 to RG122, and bypass multiplexers 154, 156, and executes a branch instruction. Here is shown only ALU 142 and branch unit 152 for simplicity, each of these pipelines 140, 150 has other arithmetic devices as well.
In the xe2x80x9cI stagexe2x80x9d, the instruction fetch unit reads out an arithmetic instruction from the instruction cache memory (not shown), then discerns the category of this arithmetic instruction, and sends out an executable instruction to the arithmetic unit. That is, the instruction fetch unit 110 fetches the arithmetic instruction and separates it into the part of instruction and the part of operands. Although not shown in FIG. 8, depending on the category of the instruction part of the arithmetic instruction, it sends the ALU instruction to the pipeline 140 having ALU 142 and conditional branch instruction to the pipeline 150 having the branch unit 152.
On the other hand, the instruction fetch unit 110 outputs source operand numbers Rs0R, Rt0R, Rs1R and Rt1R in the operand part of the arithmetic instruction to the register file 120. That is, the source operand numbers Rs0R and Rt0R are source operand number of instructions to be issued to the pipeline 140 whereas source operand numbers Rs1R and Rt1R are numbers of source operands of instructions to be issued to the pipeline 150.
Additionally, the instruction fetch unit 110 outputs destination operand number Rd0R in the operand part of the arithmetic instruction to a register RG101. This destination operand number Rd0R represents the number of the destination operand of the instruction to be issued to the pipeline 140. These source operand numbers Rs0R, Rt0R, Rs1R, Rt1R and destination operand number Rd0R are 5-bit signals. That is, here it is assumed that the processor has 32 registers.
Therefore, instruction mnemonics can be expressed as:
Add Rd, Rs, Rt
In any codes indicating various signals used in the present specification, let the end of each code indicate the stage where the signal has reached. For instance, a destination operand having the number Rd0R in the xe2x80x9cR stagexe2x80x9d gets the number Rd0A when reaching the xe2x80x9cA stagexe2x80x9d.
The instruction fetch unit 110 outputs branch delay slot information BDS0R and instruction valid information Valid1R. The branch delay slot information BDS0R is a signal indicating whether an instruction in the pipeline 140 is that of the branch delay slot of a branch likely instruction or not. The branch delay slot is an instruction positioned just after the conditional branch instruction in a row of instructions. In the example of FIG. 8, if an instruction is just after the branch likely instruction, it becomes 1, and otherwise, it becomes 0. In this instruction set architecture (ISA), one instruction existing in the branch delay slot is executed in principle whether the condition of the immediately preceding conditional branch instruction has been established or not. That is, any instruction just after a normal conditional branch instruction is executed unconditionally. However, in the case of a branch likely instruction in this instruction set architecture, one instruction existing in the branch delay slot is not executed when the branch likely instruction is not taken. When it is taken, an instruction in the branch delay slot is executed.
One example of normal conditional branch instructions is shown in FIG. 10. In FIG. 10, Add instruction is an instruction of adding contents of the register r2 and contents of register r3 and storing its result in the register r1. BNE instruction is an instruction establishing branch when contents of the register Rs and contents of the register Rt are different. That is, when (contents of register r1)xe2x89xa0(contents of register r2), branch establishes with BNE instruction, and the process returns to Add instruction labeled Loop. However, Sub instruction existing in the branch delay slot is executed even when branch is established. That is, the instruction execution sequence is as follows:
Add xe2x86x92 . . . xe2x86x92BNExe2x86x92Subxe2x86x92Add.
On the other hand, when the branch instruction is not taken, since the row of instructions is directly executed sequentially, the instruction execution sequence is as follows:
Add xe2x86x92 . . . xe2x86x92BNExe2x86x92Subxe2x86x92Iw.
Here, Sub instruction is an instruction of subtracting contents of the register r5 from contents of the register r4 and storing its result in the register r3. Iw instruction is an instruction of loading data from memory whose address is 0+(contents of register r7) to the contents of the register r6.
The above is the row of execution of the normal conditional branch instruction. Next explained is a row of execution of branch likely instruction. FIG. 11 is a diagram showing a row of execution of branch likely instruction. As mentioned above, the branch likely instruction is an instruction for executing an instruction in the branch delay slot when branch is established, but not executing the instruction in the branch delay slot when branch is not established.
As shown in FIG. 11, BNEL instruction makes branching be established, and causes the process to return the Loop label and execute Add instruction when Rtxe2x89xa0Rs, namely, (contents of register r1)xe2x89xa0(contents of register r2). Additionally, the Sub instruction in the branch delay slot is executed when branch is established. Therefore, instruction execution sequence is as follows:
Add xe2x86x92 . . . xe2x86x92BNELxe2x86x92Subxe2x86x92Add
On the other hand, in case of the branch likely instruction, the branch delay slot is not executed when branch is not established. Therefore, the instruction execution sequence is as follows:
Add xe2x86x92 . . . xe2x86x92BNELxe2x86x92Iw
In this manner, branch likely instruction is different from the conditional branch instruction in how to progress the process when branch is not established, and the instruction in the branch delay slot is not executed.
Explanation is returned again to FIG. 8. Four source operand numbers Rs0R, Rt0R, Rs1R and Rt1R sent from the instruction fetch unit 110 in the xe2x80x9cR stagexe2x80x9d are input to the register file 120. In the register file 120, contents of registers corresponding to these source operand numbers Rs0R, Rt0R, Rs1R and Rt1R are read out. That is, data contents stored in individual registers are read out, and these data are taken as the source operand.
In this example, the source operand is 64-bit data, and these data, thus read out, are held in registers RG110, 111, 120 and 121, and sent to the next xe2x80x9cA stagexe2x80x9d.
In the pipeline 140 in the xe2x80x9cA stagexe2x80x9d, unless data bypass occurs, the source operand read from the register file 120 is transferred to ALU 142 and undergoes arithmetic operation. Data bypass occurs in the following case.
That is, if instructions in a data dependent relationship are closely positioned, before the result of preceding arithmetic operation is written back to the register file 120, the data dependent instruction is executed. Therefore, it is necessary to bypass the result of the preceding operation directly to the bypass multiplexers 144, 146, 154 and 156 not through the register file 120. FIG. 12 is a diagram showing a row of instructions including instructions in data dependent relationship in close locations.
In FIG. 12, the result of Add instruction is stored in the register r1, and the register r1 is used also as the source operand of Sub instruction. When two instructions are close to each other in this manner, the result of Add instruction has to be supplied from the xe2x80x9cD stagexe2x80x9d to the xe2x80x9cA stagexe2x80x9d of Sub instruction by using an internal bypass DA. This is called data bypass from xe2x80x9cD stagexe2x80x9d to xe2x80x9cA stagexe2x80x9d. For similar reasons, it may occur that data should be bypassed from xe2x80x9cW stagexe2x80x9d to xe2x80x9cA stagexe2x80x9d by using the bypass WA.
When the arithmetic operation in ALU 142 is finished, its results are stored in the register RG112 in the D stage. In xe2x80x9cD stagexe2x80x9d, although not shown here, the data cache memory is accessed. Therefore, the result of operation of ALU 142 is held in the register RG112 only during xe2x80x9cD stagexe2x80x9d to synchronize the timing for writing into the register file 120. Then, in the next cycle, it is stored in the register RG113 in the xe2x80x9cW stagexe2x80x9d, and written back to the register file 120.
In the pipeline 150 having the branch unit 152, BNE instruction, BEQ instruction, BNEL instruction and BEQL instruction are processed.
BNE instruction is a normal conditional branch instruction, and branch is established when two source operands are not equal. BEQ instruction is also a normal conditional branch instruction, and branch is established when two source operands are equal. BNEL instruction is a branch likely instruction, and branch is established when two source operands are not equal. BEQL instruction is also a branch likely instruction, and branch is established when two source operands are equal.
FIG. 13 is a diagram showing an internal structure of the branch unit 152. As shown in FIG. 13, an operand P1RsA from the bypass multiplexer 154 and an operand P1RtA from the bypass multiplexer 156 are inputted into the branch unit in the xe2x80x9cA stagexe2x80x9d. As mentioned before, the operands P1RsA and P1RtA are 64-bit data.
The operands P1RsA, P1RtA input to the branch unit 152 are introduced into a compare logic 160 for all bit comparison. Then, the compare logic 160 outputs 1 when these operands P1RsA and P1RtA are equal, and outputs 0 when they are not equal.
Output of the compare logic 160 is input to an AND circuit 161 in an inverted form and to an AND circuit 162 directly without being inverted. The AND circuit 161 is supplied with a decode BNE signal (DBNE) as well, and the AND circuit 162 is supplied with a decode BEQ signal (DBEQ) as well. The decode BNE signal (DBNE) is a signal which becomes 1 when the BNE instruction or the BNEL instruction reaches the xe2x80x9cA stagexe2x80x9d, and the decode BEQ signal (DBEQ) is a signal which becomes 1 when the BEQ instruction or the BEQL instruction reaches the xe2x80x9cA stagexe2x80x9d.
Outputs from these AND circuits 161, 162 are input to an OR circuit 163. Output from the OR circuit 163 is input to a NAND circuit 164 which is also supplied with instruction valid information Valid1A. This instruction valid information Valid1A is a signal indicating that a valid instruction has reached the xe2x80x9cA stagexe2x80x9d of the pipeline 150. As shown in FIG. 8, the instruction valid information Valid1A is output in the xe2x80x9cR stagexe2x80x9d together with an instruction by the instruction fetch unit 110 to the register RG103. This is the signal transferred to register 104 in the xe2x80x9cA stagexe2x80x9d along the pipeline stage.
Output of the NAND circuit 164 becomes a branch condition not-taken signal NTknA and becomes the output of the branch unit 152. As shown in FIG. 13, since the branch unit 152 is made, in the case where the operands P1RsA and P1RtA are equal and the instruction is the BEQ instruction or the BEQL instruction, branch is taken, and the branch condition not-taken signal NTknA becomes 0. In the case where the operands P1RsA and P1RtA are not equal and the instruction is the BNE instruction or the BNEL instruction, branch is taken, and the branch condition not-taken signal NTknA becomes 0. In all cases other than these two cases, the branch condition not-taken signal NTknA becomes 1.
The bypass select logic circuit 130 shown in FIG. 8 controls the bypass multiplexers 144, 146 in the pipeline 140, and the bypass multiplexers 154, 156 in the pipeline 150. This control is executed for transferring proper operands to ALU 142 and branch unit 152.
More specifically, the bypass select logic circuit 130 generates four select signals SelRs0, SelRt0, SelRs1 and SelRt1, and supplies them to the bypass multiplexers 144, 146, 154 and 156, respectively. The select signals SelRs0, SelRt0, SelRs1 and SelRt1 are one-hot 3-bit signals in which one bit in each 3-bit signal becomes 1.
In this example, when the bit 0 is 1, the bypass multiplexers 144, 146, 154 and 156 select and output operands from the bypass DA from the xe2x80x9cD stagexe2x80x9d. When bit 1 is 1, the bypass multiplexers 144, 146, 154 and 156 select and output operands from the bypass WA from the xe2x80x9cW stagexe2x80x9d. When bit 2 is 1, the bypass multiplexers 144, 146, 154 and 156 select and output operands from the register file 120.
Taking the select signal (SelRt0 as an example, an operation of the bypass select logic circuit 130 is explained. The select signal SelRt0 is required to be generated in a cycle preceding the cycle where data is expected to be bypassed, then latched by a flip-flop, and thereafter supplied to the bypass multiplexer 144. In general, since the xe2x80x9cA stagexe2x80x9d in the pipeline is the stage where operators such as ALU 142 operate, it is the stage where operation timing is the longest in most cases. Therefore, it is necessary to determnine the operands to be applied to ALU 142, etc. at the earliest possible timing For this purpose, operands of the bypass multiplexer 144 must also be made to pass at the earliest possible timing, and the select signal SelRt0 to be applied to the bypass multiplexer 144 must be decided earlier. Usually, therefore, the select signal SelRt0 should be generated in the xe2x80x9cR stagexe2x80x9d which is precedent to the xe2x80x9cA stagexe2x80x9d.
FIG. 14 is a diagram showing an example of the internal structure of the bypass select logic circuit 130. As shown in FIG. 14, the bypass select logic circuit 130 includes four select signal generating circuits 132A through 132D. The select signal generating circuits 132A through 132D generate select signals SelRs0, SelRt0, SelRs1, and SelRt1, respectively.
The select signal generating circuits 132A through 132D are similar in structure. For example, the select signal generating circuit 132A includes compare logic 172, 174, AND circuits 176, 178, 180, 182, and inverter circuits 184, 186.
Operation of the select signal generating circuit 132A is explained, taking a case where a data bypass from the xe2x80x9cD stagexe2x80x9d to xe2x80x9cA stagexe2x80x9d occurs. A row of instructions causing the data bypass from the xe2x80x9cD stagexe2x80x9d to the xe2x80x9cA stagexe2x80x9d is the row of instructions shown in FIG. 12 explained above.
In the example shown in FIG. 12, the result of the arithmetic operation of the Add instruction has to be bypassed from the xe2x80x9cD stagexe2x80x9d to the xe2x80x9cA stagexe2x80x9d in the fourth cycle. For this purpose, in the preceding third cycle, the select signal SelRs0 therefore has to be generated.
In the third cycle, for the purpose of detecting data dependency between the Add instruction and the Sub instruction, the destination operand number Rd0A of the Add instruction having reached the xe2x80x9cA stagexe2x80x9d is compared with the source operand number Rs0R of the Sub instruction having reached the xe2x80x9cR stagexe2x80x9d by the compare logic 172. The compare logic 172 outputs 1 when the destination operand number Rd0A and the source operand number Rs0R coincide, and outputs 0 when they do not coincide. In this example, since the destination operand number Rd0A coincides with the source operand number Rs0R, the compare logic 172 outputs 1. Therefore, one of the inputs of the AND circuit 176 becomes 1.
The other input of the AND circuit 176 is supplied with a signal from a NAND circuit 190 which takes NAND of the branch delay slot information BDS0R and the branch condition non-taken signal NTknA. That is, instruction of the xe2x80x9cA stagexe2x80x9d is the next instruction of the branch likely instruction, and when this branch likely instruction is not taken, the output of the NAND circuit 190 becomes 0. In this example, since the Add instruction is not the branch display slot, the output of the NAND circuit 190 becomes 1. Therefore, the output of the AND circuit 176 is 1, and bit 0 of the select signal SelRs0 becomes 1. On the other hand, since the output of 1 at the AND circuit 176 is input to AND circuits 180, 182 through an inverter circuit 184, outputs of the AND circuits 180, 182 turn out 0, and bit 1 and bit 2 of the select signal SelRs0 become 0. The select signal SelRs0 is latched in a flip-flop 192A.
The select signal SelRs0 latched in the flip-flop 192A is input to the bypass multiplexer 144 in the next cycle. Based on the select signal SelRs0, the bypass multiplexer 144 selects an operand input from the bypass DA from the xe2x80x9cD stagexe2x80x9d, and outputs it to ALU 142.
Next explained is a case where an instruction of the branch delay slot is cancelled and a data bypass from the xe2x80x9cD stagexe2x80x9d to xe2x80x9cA stagexe2x80x9d does not occur. FIG. 15 is a diagram showing a row of instructions causing the above-explained processing.
In the example shown in FIG. 15, register r1 of the Add instruction coincides with register r1 of the AND instruction. Additionally, the Add instruction comes next to the BNEL instruction which is a branch likely instruction. Further assume here that for this BNEL instruction, it is predicted the branch will not be taken. Therefore, the AND instruction coming next to the Add instruction in the row of instructions is executed speculatively. In this case, if the condition of the BNEL instruction is not established as predicted, the Add instruction positioned in the branch delay slot is cancelled, and therefore the data bypass from the Add instruction to the AND instruction is not effected. That is, in the fourth cycle, the bypass multiplexer 144 outputs the source operand input from the register file RG110 to ALU 142.
However, in the above-explained processor, since the bypass select logic circuit 130 has a long processing time, the operation frequency of the processor decreased. Its reason lies in that, for the judgment whether the branch of the conditional branch instruction is taken or not, it is necessary to compare the operand Rs and the operand Rt over all 64 bits in the branch unit 152 and it is not decided until the comparison progresses near to the end of the cycle.
This is explained in greater detail with reference to FIG. 16. FIG. 16 is a diagram explaining the operation timing of the branch condition non-taken signal NTknA and the select signal SelRs0 when the BNEL instruction and the AND instruction are positioned in consecutive cycles as shown in FIG. 15.
As shown in FIG. 16, in the first half of the xe2x80x9cA stagexe2x80x9d of the BNEL instruction, the bypass multiplexers 154 and 156 operate, and the operand to be input to the branch unit 152 is determined. After that, all bits of the 64-bit operand are compared in the branch unit 152. Therefore, it is nearly at the end of the xe2x80x9cA stagexe2x80x9d that the branch condition non-taken signal NTknA as the result of the comparison is determined. As explained above, since the bypass select logic circuit 130 needs the branch condition not-taken signal NTknA to generate the select signals SelRs0, SelRt0, SelRs1 and SelRt1, the generation of the select signals SelRt0, SelRs0, SelRs1 and SelRt1 is inevitably further delayed.
In general, it is optimum that the cycle time in a pipeline processing be set approximately in accordance with the time when the ALU arithmetic operation in the xe2x80x9cA stagexe2x80x9d comes to an end. That is, when it is set in this manner, highest operation frequency and best hardware efficiency are expected.
However, the time required for processing of ALU 142 and the time required for processing of the branch unit 152 are substantially equal. Additionally, the timing when ALU 142 is applied with an operand is substantially equal to the timing when the branch unit is applied with an operand. As a result, the rest of time for the bypass select logic circuit 130 using the branch condition not-taken signal NTknA which is the comparison result is small. Therefore, the bypass select logic circuit 130 becomes a bottleneck which decreases the operation frequency of the processor.
It is therefore an object of the invention to ensure that the bypass select logic circuit never become a bottleneck upon determination of the operation frequency of a processor. Thereby an object of the invention is to provide a processor with a high operation frequency.
According to the invention, there is provided a processor which executes pipeline processing having a plurality of stages, comprising:
an instruction fetch unit for fetching an arithmetic instruction to output a source operand number and a destination operand number of the arithmetic instruction, and when the arithmetic instruction is a conditional branch instruction, the instruction fetch unit predicting whether the condition of the conditional branch instruction will be established or not, and outputting its result as prediction result information;
a register file introducing the source operand number outputted from the instruction fetch unit to output a source operand corresponding to the source operand number;
a first pipeline having at least a bypass multiplexer and an arithmetic and logic unit, in which the bypass multiplexer is supplied with an arithmetic result operand which is a result of arithmetic operation of the arithmetic and logic unit and the source operand outputted from the register file, and the bypass multiplexer selecting one of the arithmetic result operand and the source operand in response to a select signal and outputting it to the arithmetic and logic unit;
a second pipeline including at least a branch unit for judging whether the condition of the branch instruction has been established or not; and
a bypass select logic circuit for generating the select signal by using at least the prediction result information.
According to the invention, there is further provided a processor for executing pipeline processing including at least five stages from a first stage to a fifth stage, and having as an instruction set architecture at least a normal conditional branch instruction for executing an arithmetic instruction positioned in a row of instructions next to a conditional branch instruction irrespective of whether the condition of the conditional branch instruction is established or not, and a branch likely instruction for executing the arithmetic instruction positioned in the row of instructions next to the conditional branch instruction only when the condition of the conditional branch instruction is established, comprising:
an instruction fetch unit for fetching at least two arithmetic instructions and outputting one or two source operand numbers and one destination operand number for each of the arithmetic instructions, the instruction fetch unit predicting upon one of the arithmetic instructions being a branch likely instruction whether the condition of the branch likely instruction is established or not, and outputting its result as prediction result information in the first stage;
a register file for receiving the source operand numbers outputted from the instruction fetch unit and outputting source operands corresponding to the source operand numbers in the second stage;
a first bypass multiplexer supplied with a first arithmetic result operand which is an arithmetic result operand having reached the fourth stage, a second arithmetic result operand which is an arithmetic result operand having reached the fifth stage and one of the source operands of one of the arithmetic instructions, which is outputted from the register file, to select one of the first arithmetic result operand, the second arithmetic result operand and one of the source operands of one of the arithmetic instructions in response to a first select signal and output it as a first arithmetic operand in the third stage;
a second bypass multiplexer supplied with the first arithmetic result operand having reached the fourth stage, the second arithmetic result operand having reached the fifth stage and the other of the source operands of one of the arithmetic instructions, which is outputted from the register file, to select one of the first arithmetic result operand, the second arithmetic result operand and the other of the source operands of the one of the arithmetic instructions in response to a second select signal and output it as a second arithmetic operand in the third stage;
an arithmetic and logic unit supplied with the first arithmetic operand and the second arithmetic operand to execute an arithmetic operation on the basis of the first arithmetic operand and the second arithmetic operand in the third stage and output the arithmetic result as an arithmetic result operand;
a third bypass multiplexer supplied with the first arithmetic result operand having reached the fourth stage, the second arithmetic result operand having reached the fifth stage and one of the source operands of the other of the arithmetic instructions, which is outputted from the register file, to select one of the first arithmetic result operand, the second arithmetic result operand and the one of the source operands of the other of the arithmetic instructions in response to a third select signal and output it as a first comparison operand in the third stage;
a fourth bypass multiplexer supplied with the first arithmetic result operand having reached the fourth stage, the second arithmetic result operand having reached the fifth stage and the other of the source operands of the other of the arithmetic instructions, which is outputted from the register file, to select one of the first arithmetic result operand, the second arithmetic result operand and the other of the source operands of the other of the arithmetic instructions in response to a fourth select signal and output it as a second comparison operand in the third stage;
a branch unit supplied with the first comparison operand and the second comparison operand in the third stage to compare the first comparison operand and the second comparison operand and judge whether the condition of the branch instruction is established or not; and
a bypass select logic circuit for generating the first to fourth select signals by using at least the prediction result information in the second stage.