1. Field of the Invention
The present invention relates to an information processing apparatus and method for performing a pipeline process using a branch prediction mechanism.
2. Description of the Related Art
A recent computer has a branch prediction mechanism using a table to quickly perform an instruction process, especially a branch process. The branch prediction mechanism contains in the branch target table the prediction information about whether or not there is a branch instruction, and, if there is a branch instruction, about whether or not the branch is to be taken, and the information about a branch target instruction address, and searches the table using an instruction fetch address when an instruction is fetched. If there is a branch instruction and the information that the branch is taken is entered, then a branch target instruction fetch is invoked using a corresponding branch target instruction address.
Thus, a branch target instruction fetch can be invoked at an earliest possible timing without fetching and decoding an instruction. As a result, a wait time taken for fetching a branch target instruction can be shortened, and a process can be quickly performed by a computer.
The branch prediction using the above described table is described in detail in various documents and patent applications. For example, the Japanese Patent Laid-Open No.6-324865 (xe2x80x98Multiprediction-type Branch Prediction Mechanismxe2x80x99, Japanese Patent Application No.3-252872) discloses the technology of using both branch prediction used in a decoding process and branch prediction used in an instruction fetching process. In this document, various prior art technologies relating to branch prediction are introduced.
A program having a loop structure can be quickly executed in most cases by predicting a branch because it is predicted that a branching operation using a previously branched instruction can occur again, and the branch prediction in a program in which a loop process is executed is correct except an exit from the loop. Especially, the prior art technologies which emphasize quick execution of a loop can be listed below.
Japanese Patent Application Laid-Open No.7-73104 (xe2x80x98Cache Systemxe2x80x99, Japanese Patent Application No.6-117056)
Published Japanese Translation of PCT International publication for Patent Application No.10-510076 (xe2x80x98Limited Run Branch Predictionxe2x80x99, Japanese Patent Application No.8-518876)
Japanese Patent Application Laid-Open No.10-333906 (xe2x80x98Branch Prediction Apparatusxe2x80x99, Japanese Patent Application No.9-139736)
Japanese Patent Application Laid-Open No.4-101219 (xe2x80x98Instruction Branch Prediction Systemxe2x80x99, Japanese Patent Application No.2-218655)
Japanese Patent Application Laid-Open No.63-141132 (xe2x80x98Instruction Prefetch Apparatusxe2x80x99, Japanese Patent Application No.61-289363)
Japanese Patent Application Laid-Open No.1-271842 (xe2x80x98Information Processing Apparatusxe2x80x99, Japanese Patent Application No.63-100818)
The above described technologies are generally classified as follows.
(1) To quickly fetch a branch target instruction
(2) To correctly predict the behavior of a loop
(3) To reduce the penalty when a branch prediction fails.
FIG. 1A shows a general configuration of a common pipeline computer. When an instruction fetch address and a fetch request is transmitted from an instruction fetch control unit 11 to an instruction memory access pipeline (instruction fetch pipeline) 12, an instruction code is fetched based on the address, and supplied to an instruction buffer 13. The instructions stored in the instruction buffer 13 are passed to an instruction execution pipeline 14, thereby starting the execution of the instructions.
The instruction execution pipeline 14 executes an operations instruction, makes a branch decision when a branch instruction is detected, and transmits the information about whether or not a branch is taken (branch taken information), a branch target address, etc. to the instruction fetch control unit 11. Under the control by the instruction fetch control unit 11, the instruction fetch continues based on the new instruction fetch address.
FIG. 1B shows the configuration of the circuit of the instruction fetch control unit 11. The circuit shown in FIG. 1B includes incrementers 21 and 22, a branch target table (BTB) 23, a comparison circuit (CMP) 24, an AND circuit 25, selection circuits (selector, SEL) 26 and 27, and a prefetch program counter (PFPC) 28.
When a branch prediction is not made in branch fetch control, a signal D_TA (branch target address of an instruction processed at the decoding stage), the BTB 23, the CMP 24, the AND circuit 25, and the 2-input selector 26 are not required.
In this case, the first fetch address is selected by the selection circuit 27 through a signal PC, and enters the PFPC 28. The address held in the PFPC 28 is an instruction fetch address, and transmitted to the instruction memory access pipeline 12 shown in FIG. 1A. At this time, the incrementer 22 receives an output signal from the PFPC 28, adds 8 to the signal, and outputs a resultant address.
If a signal indicating that a branch is taken (TAKEN described later) is not asserted (if the signal is not a logic xe2x80x981xe2x80x99), then an output from the incrementer 22 is selected at the next clock, and the address is fetched. Thus, consecutive instructions are fetched. Here, it is assumed that an instruction occupies 4 bytes and two instructions are simultaneously fetched.
FIG. 1C shows the circuit of the instruction execution pipeline 14 shown in FIG. 1A. The circuit shown in FIG. 1C is provided with the BTB 23, registers 31, 33, 34, 35, 36, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49, 51, 53, 54, 55, 56, 57, 58, and 59 , a general purpose register (GR) 32, an arithmetic and logic unit 37, a decoder (DEC) 40, a branch decision circuit (BRC_DECISION) 50, and an adder 52.
In FIG. 1C, the characters D, A, E, and W indicating the process stages of instructions respectively correspond to a decode stage (D stage), an address computation stage (A stage), an execution stage (E stage), and a write stage (W stage).
In this example, two instructions are executed in a pipeline operation. The preceding instruction is a comparison instruction (CMP). This instruction reads two values to be compared with each other from the general purpose register 32, performs a comparing operation using the arithmetic and logic unit 37 at the E stage, and writes the resultant condition code (CC) to the register 38.
The subsequent instruction is a branch instruction (BRC). This instruction computes the branch target address at the D stage according to the signal D_PC (a program counter of the instruction processed at the D stage), the branch target address offset portion (OFFS) in the branch instruction, and the constant of 4.
In a decoding operation, a signal D_BRC indicating whether or not an instruction is a branch instruction, and a signal D_CND indicating the condition of branching are generated, managed as pipeline tags, and transferred to the subsequent stage. Since the CC of the result of the preceding comparison instruction is used in a branch decision, a branch decision for the subsequent branch instruction is made at the E stage. In this example, the branch decision is made by the branch decision circuit 50 using the condition of the CC selected by the E_CND.
When a branch is taken, the signal TAKEN from the branch decision circuit 50 is asserted, the signal E_TA (the branch target address of the instruction processed at the E stage) is selected as shown in FIG. 1B, and the address is input to the PFPC 28.
The address E_TA corresponds to the address obtained by adding up the signal D_PC, the offset, and 4 at the D stage of the branch instruction as shown in FIG. 1C, managed as a pipeline tag, and transferred to the E stage. However, in this example, the branch target address of the branch instruction is defined as a sum of the value of the program counter indicating the subsequent instruction and the offset in the branch instruction.
Described below is the operation performed when a branch is predicted using a part of an instruction referred to as a HINT bit. The prediction according to the HINT bit is a technology of quickly performing a process using the feature that branches of some instructions can be correctly predicted with a high probability before actually executing the instructions.
For example, there is a high possibility that a branch is actually taken in a branching operation used to repeat a loop using a DO statement of Fortran or a for statement of the C language while there is a small possibility that a branch is actually taken in a branching operation used in a determination for an exit from the loop. This information is embedded as a HINT bit in a part of the instruction code, and a branch target fetch is invoked in a decoding operation before the condition is discriminated using the CC, etc., thereby quickly performing the process.
Therefore, in FIG. 1B, if the signal D_TA is used, and the signal D_HINT decoded at the D stage shown in FIG. 1C is asserted, then the signal D_TA is selected, thereby updating the PFPC 28. As a result, the branch target instruction can be fetched more quickly by 2 clocks than the switch at the E stage.
However, if it is turned out in the branch decision at the E stage that the branch prediction is not correct, then the E_PC+4 output from the incrementer 21 is selected, and a fetching operation is performed again from the instruction immediately after the branch instruction. Another processing method to be followed when branch prediction is not correct is to prefetch unpredicted instructions and execute the instructions, to preliminarily execute the subsequent instructions in the case where the branch is not taken, etc.
It is determined whether or not a branch has been correctly predicted by comparing the HINT information (E_HINT) transferred up to the E stage by the pipeline tag with the signal TAKEN generated from the CC at the E stage. At this time, it is determined that the branch prediction has failed if E_HINT=1 and TAKEN=0.
Described below is the operation performed when a branch is predicted using the BTB 23. In the above described prediction using the HINT bit, the branch prediction is invoked at an early stage. However, since it is not invoked until an instruction is read to the instruction execution pipeline 14 shown in FIG. 1A, the branch target instruction can be fetched earlier only by 2 clocks. To invoke a branch target instruction fetch at a furthermore early timing, it is necessary to fetch a branch target at the stage of the control by the instruction fetch control unit 11 shown in FIG. 1A.
To attain this, a branch prediction mechanism for obtaining a branch target address from the fetch address is added. The BTB 23 shown in FIG. 1C caches the branch target address as if it were an instruction cache for the main storage. When a branch instruction is executed, the address of the branch instruction and the address of the branch target instruction are entered in the BTB 23.
FIG. 1D shows the circuit of the instruction memory access pipeline 12 containing the BTB 23. The instruction memory access pipeline 12 includes a P (priority) stage, a T (table) stage, a C (check) stage, and an R (result) stage. In addition to the BTB 23 shown in FIG. 1C, it contains registers 60 and 61, an instruction cache (I-cache) 62, a register group 63, and a selection circuit (Way sel) 64.
The BTB 23 is a 2-port RAM (random access memory) including a read port which receives a read address stored in the register 60, and a write port which receives a write address stored in a register 58 shown in FIG. 1C. FIG. 1D shows only the circuit of the read system of the BTB 23.
The instruction buffer 13 includes a register group 65 and a selection circuit 66. A register 67 corresponds to the register 31 or 39 shown in FIG. 1C.
When an instruction is fetched, the instruction cache 62 is searched using the address from the PFPC 28, and similarly the BTB 23 is searched. If the address of the PFPC 28 is entered in the BTB 23, then the address stored in the PFPC 28 is changed into the branch target address of the BTB 23. Thus, when the instruction is fetched, the instruction fetch address is switched into a branch target address.
When an actual branching operation is performed in the instruction execution pipeline 14, the signal E_BRC is asserted, and the BTB 23 is updated. At this time, as shown in FIG. 1C and FIG. 9 shown later, the branch prediction information NW_PRD (Next Write_Prediction) is generated by the branch decision circuit 50 from the signal E_PRD and the signal TAKEN transferred using a pipeline tag. Then, the signal NW_PRD is written together with the signals E_PC and E_TA to the BTB 23. Described above is the outline of the operation of the typical conventional branch prediction.
However, the technology of the above described conventional branch prediction has the following problem.
In the method (2) above, the number of times of a loop is stored in a register, and an exit from the loop is predicted from the number of times and the pattern of the loop. However, there is a loop for which the above described prediction cannot be made. For example, a loop process may be performed to await the completion of an operation of an input/output device (IO), the completion of the barrier synchronization among a plurality of processor elements (PEs), or the completion of main storage access.
FIG. 1E shows an example of a parallel computer which awaits an event. The parallel computer shown in FIG. 1E includes n+1 processor elements PE 0, PE 1, . . . , PE n. Each PE includes a central processing unit (CPU) 71, memory (main storage) 72, an input/output device (IO) 73, a vector unit (VU) 74, and a synchronization circuit (PE Sync) 75.
The CPU 71, the memory 72, the IO 73, the vector unit 74, and the synchronization circuit 75 are interconnected through an address bus 76 (A) and a data bus 77 (D). The CPU 71 and the vector unit 74 are connected to each other through a vector command bus 78 (VC) and a vector data bus 79 (VD). In addition, the CPU 71 includes a branch unit (BRU) 81, an access control unit 82, and a vector interface control unit 83.
The IO 73 performs an operation relating to the input and output of data. The CPU 71 awaits the completion of the operation as necessary.
The synchronization circuit 75 is provided for synchronizing the PEs and can be accessed by the CPU 71 as a part of the IO (including a memory mapped IO) or using an exclusive instruction. When each PE completes its assigned job, the CPU 71 writes data indicating the completion of the event to the synchronization circuit 75. The written information is propagated to all PEs through a signal line 84 for connecting the synchronization circuits 75 of the PEs. When a predetermined condition is satisfied, the awaiting synchronization among the PEs can be released.
The vector unit (VU) 74 is provided to quickly perform a computing process, and contains a vector operations unit. The vector operations unit can quickly perform a consecutive operations, etc., but cannot perform all processes. Therefore, a computer referred to as a vector computer is provided not only with the vector unit 74, but also with the CPU 71 having a common configuration. The CPU 71 is referred to as a scalar processor or a scalar unit (SU) to explicitly indicate that it performs no vector processes.
When the result obtained by the vector unit 74 is furthermore processed by the SU 71, and when the data prepared by the SU 71 is received by the vector unit 74 for processing, it is necessary to provide a mechanism for synchronizing the data in the two units. If there is a small volume of data, the vector data bus 79 can be used. However, if a large volume of data is transmitted and received, a synchronizing process is performed through the memory 72.
When the computation result from the vector unit 74 is received by the SU 71, the computation result is written to the memory 72. The SU 71 has to await the completion of the writing, and start executing a memory read instruction. When the computation result from the SU 71 is received by the vector unit 74 through the memory 72, the computation result is written into the memory 72. The VU 74 has to await the completion of the writing, and start executing a memory read instruction. Therefore, in any case, the completion of the writing to the memory 72 is awaited.
Thus, awaiting the completion of an event is an important element when the performance of a computer is to be improved. When an event is awaited in a loop process, the performance in exiting from a loop when the waited event occurs as well as the performance of a branch for repeating the loop is required.
When there is a branch for awaiting an event, the number of times of a loop process is not set in a register, and the number of times of the loop process is not constant every time. Therefore, no predictions can be made or are correct by the conventional technology. Practically, since it is predicted in most predicting methods that a branch is taken if the number of times of branching is large in the recent executions, a branch prediction is not correctly made at the exit from the loop, thereby lowering the speed of the exit from the loop, which is an essential element in awaiting an event.
In addition, relating to the branch condition flag, the conventional computer determines the signs or the comparison result of sizes of two values based on the result of the preceding operations. Therefore, if the conditions of the completion of the operation of the IO 73 external to the CPU 71, the completion of barrier synchronization, etc. are awaited, then external information is referred to by a load instruction (in the case of memory mapped IO) referring to an external signal or an IO instruction to reflect the external state in the condition flag by a logic operation, etc., and a branching operation is performed.
Therefore, although the branching operation can be quickly performed, a loop cannot be executed within a time shorter than a sum of an external information reference time and a logic operation time. Therefore, waiting cannot be released (sampling cannot be made) in a time shorter than the sum of the times, thereby prolonging the waiting time for release of a loop.
That is, in the conventional method, there arise the cost of the sampling time for release of waiting, and the cost of a penalty (additional time required to make a branch prediction) for an erroneous prediction of a branch at an exit from a loop, thereby making it impossible to perform a quick waiting process.
The present invention aims at providing an apparatus and a method of processing information for improving a prediction of the behavior of an event-awaiting branch loop in a pipeline process containing a branch prediction, thereby speeding up the exit from the loop.
The information processing apparatus according to the present invention includes a detection circuit and a suppression circuit, and performs a pipeline process containing a branch prediction. The detection circuit detects that an instruction is a branch instruction for awaiting an event. The suppression circuit suppresses the branch prediction for the branch instruction when the detection circuit detects the branch instruction for awaiting the event.