1. Field of the Invention
The present invention relates to a speculative instruction execution control device and a method for the same. In particular, in a processor that has a prediction unit for speculatively executing computer instructions in order to increase processing speed, it relates to technology for easily implementing with a simple hardware configuration value prediction verification control for speculative instruction execution, and a recovery control when a prediction error occurs.
2. Description of Related Art
Speculative instruction execution is used as a method for increasing computer speed, namely processor speed. Speculative execution means that an instruction is speculatively executed in conformity with a prediction made by a prediction mechanism in a computer. Various methods have been suggested. Generally, such speculative execution is utilized in a pipelined processor.
A first example of this speculative execution includes branch prediction, that is, prediction of which instruction is the branch destination instruction of a branch instruction. With this branch instruction, instruction to be subsequently executed (i.e., fetched) by a processor, which is determined as a result of branch instruction execution, is predicted in order for the subsequent instruction to be fetched without waiting for branch instruction execution.
When branch prediction is not performed, a processor cannot fetch the subsequent instruction until the execution result of this branch instruction is obtained after fetching the branch instruction. Accordingly, in a pipelined processor, a bubble, namely a cycle in which no processing is performed, occurs between the branch instruction and instruction subsequent to this branch instruction. This occurrence of a bubble becomes a factor in the drop in processing speed of a processor.
Meanwhile, when performing branch prediction, since subsequent instruction is fetched without waiting for branch instruction execution, bubbles are reduced and this allows an increase in the processing speed of the processor if branch prediction is correct. However, if branch prediction is incorrect (i.e., prediction error) when branch prediction is performed, since speculatively fetched instructions are instructions not normally executed, it is necessary to disable such instructions and re-fetch correct instructions. The number of times (cycles) required to disable these instructions and re-execute correct instructions when a prediction error occurs are referred to as the prediction error penalty.
The overhead (the drop in processing speed of the processor) due to this prediction error penalty often exceeds the overhead in the case where branch prediction is not performed. This is because extra cycles may be needed to disable speculatively executed instructions. In recent years, processors have aimed to reduce this overhead due to such prediction error and increase processing speed by improving the accuracy of branch prediction.
A second example of speculative execution includes load value prediction, that is, prediction of a load value, which is the data value to be loaded into a processor in conformity with a load instruction. One of the problems in increasing the speed of a processor is that, in comparison with instruction for performing an arithmetic operation or logical operation, the delay time of an instruction for performing data access such as a load instruction or store instruction is longer. In recent years, while reduction in the processing time required to access data in memory still lags, progress has been made in shortening the processing time for arithmetic operations and logical operations. As a result, the difference between the two has become wider. It should be noted that memory herein may be a main memory located external to a processor, or in a processor including cache memory (also simply referred to as a cache), it may be cache memory.
For example, assuming that an arithmetic operation instruction that uses data loaded as a result of a load instruction is included in a program, this operation cannot be executed until the value to be loaded is obtained as a result of the load instruction since this subsequent arithmetic operation instruction has dependency on this load instruction. Accordingly, as the time required for load instruction execution increases, the lag of the instruction execution having dependency on this load instruction increases, and as a result, the processing speed of the processor drops.
In load value prediction, the data value to be loaded with a load instruction is predicted, and this predicted data value is used for instructions subsequent to this load instruction (a. load value prediction). It is possible to speculatively execute subsequent instructions having dependency on this load instruction while executing the load instruction using the predicted data value (b. speculative execution of subsequent instructions).
In the case where the load value prediction is correct, that is, in the case where the predicted load value is equal to the actual execution result of the load instruction, subsequent instructions speculatively executed may be completed at the point in time where load instruction is completed. Accordingly, in comparison with the case where load value prediction is not performed, the processing speed of the processor may increase for the number of cycles required to execute subsequent instructions that were speculatively executed. On the other hand, as with other prediction mechanisms for, for example, branch prediction, in the case where a load value prediction is incorrect, it is necessary to disable the speculatively executed subsequent instruction and re-execute that instruction, which results in a prediction error penalty occurring (c. prediction verification, d. disabling and re-execution of instruction).
Various algorithms have been suggested as the prediction algorithm for performing load value prediction. Since a value to be loaded from memory by the load instruction may be known only upon executing a program, predicting a value to be loaded when compiling a program, or static prediction, is difficult. Accordingly, a load value is predicted wholly by using data during program execution (dynamic prediction). This load value prediction is based on the knowledge that the same load instruction, in other words a load instruction in the same address in the program, has a statistically high probability of loading the same value. In other words, since the same load instruction has a high probability of reloading a value that was loaded in the past, there may be a high probability of obtaining the correct prediction value.
Next, an example of a conventional processor configuration for executing load value prediction is described.
FIG. 1 is an example of a processor configuration having a load value prediction unit.
Main memory 121 is a memory located external to a processor 110, and stores instructions (programs) and data.
An instruction fetch unit (IFU) 111 includes a program counter (PC) 111b, and fetches instruction (D11) (instruction fetch) to the processor 110 from an address in main memory 121 specified by value of the program counter 111b. The fetched instruction is temporarily held in an instruction window buffer (IWB) 112.
The instruction window buffer 112 may hold a plurality of instructions. The instruction window buffer 112 issues instructions to appropriate execution units 114a, 114b, or a load/ store unit 113.
The load/store unit (LSU) 113 and execution units (EX) 114a and 114b are units for executing the respective calculations thereof.
The load/store unit 113 executes load instructions and store instructions. The execution units 114a and 114b perform operations other than load instructions and store instructions. For example, integer operations, logical operations, and floating point operations are performed, but they are not limited to these operations.
A register file (RF) 117 is a group of registers included in the processor. The register file 117 transmits necessary operands (operation values) to the load/store unit 113 and execution units 114a and 114b, respectively, and writes the operation results from those units into the internal register.
A commit unit (CMT) 118 controls the timing when the operation result is written in the register file 117 (or memory in the case of store instruction) based on a comparison result input from a comparator 120 (D14). The program execution results must be written in the register file 117 or memory in accordance with the execution sequence of the program. This is because, when writing a value into memory in the same address, results change depending on the order in which written. This is also because, even if the final result is the same, intermediate results conforming to the program execution order become necessary when an exception occurs.
A load value prediction unit (LPU) 116 predicts the value to be loaded by a load instruction, and transmits the predicted value (D12) to a value prediction buffer 119 and register file 117. There are various algorithms suggested for the load value prediction algorithm, however, since the present invention does not suppose a specific load value prediction algorithm, and may be applied regardless of the load value prediction algorithm, specific mentioned is not made herein. To give an example of a load value prediction algorithm, there is a method where a value that has been loaded in conformity with a load instruction is stored in the address (program counter value) of that load instruction, and then, when the next load instruction is fetched, the value previously loaded and stored is used as a prediction value.
A value prediction buffer (VPB) 119 is a buffer that stores a load prediction value input from the load value prediction unit 116. The predicted value stored in this value prediction buffer 119 is used for comparison by a comparator 120 with a value (D13) actually loaded later through load instruction execution.
The comparator 120 compares a load prediction value predicted by the load value prediction unit 116 with a value actually loaded by the load instruction.
Next, operations of instruction execution and load value prediction in the processor in FIG. 1 are described.
(1) Operation When Load Value Prediction is not Performed
1. The instruction fetch unit 111 fetches an instruction from the main memory 121 in conformity with the program counter 111b. 
2. The fetched instruction is temporarily held in the instruction window buffer 112.
3. The instruction window buffer 112 issues an issuable instruction to an execution unit 114a, 114b, or load/store unit 113 appropriate for executing the instruction.
4. The register file 117 issues the operands necessary for instruction execution to the load/store unit 113 or execution unit 114a or 114b. 
5. The load/store unit 113, or execution unit 114a or 114b executes instruction. In other words, when the instruction is a load instruction, the load/store unit 113 loads data from the main memory 121. When the instruction is store instruction, the load/store unit 113 prepares for writing data in the main memory 121. In the case of other instructions, the execution units 114a and 114b execute calculation.
6. The commit unit 118 controls the timing of writing in the register file 117 or main memory 121. In other words, when the instruction is a load instruction, the loaded data is written in a register of the register file 117. When the instruction is a store instruction, data is written in the main memory 12. When the instruction is a branch instruction, the program counter 111b in the instruction fetch unit 111 is updated in conformity with the result of the branch instruction. In the case of other instructions, the operation result is written in the register of the register file 117.
It is assumed herein that each processing 1 through 6 mentioned above is executed for every processor cycle. Generally, there are many pipelined processors having pipelined structure that execute processing for every cycle.
(2) Operation When Load Value Prediction is Performed
1. The instruction fetch unit 111 fetches instruction from the main memory 121 in conformity with the program counter 111b. 
2. The load value prediction unit 116 determines whether or not the fetched instruction is a load instruction that is a subject to prediction. If the fetched instruction is a load instruction that is the subject to prediction, a load prediction value is written in the register in which the result of the load instruction is to be written.
3. The fetched instruction is temporarily held in the instruction window buffer 112.
4. The instruction window buffer 112 issues a load instruction to the load/store unit 113.
5. The register file 117 supplies operands required for load instruction execution to the load/store unit 113.
6. The instruction window buffer 112 issues an instruction subsequent to the load instruction.
7. The comparator 120 compares the loaded value obtained through the load instruction execution with the predicted value after the results of the load instruction are obtained.
8. When a prediction verification result is obtained from the comparator 120:                In the case where the prediction is correct, the load instruction and subsequent instruction that was speculatively executed become committable. The commit unit 118 writes the execution results of such instructions in the register or main memory 121 and commits (completes) the instructions with the appropriate timing.        Meanwhile, in the case where the prediction is incorrect, a load value is written in the register in which the execution result of the load instruction is to be written and the load instruction is committed (completed). However, all subsequent instructions are disabled and these instructions are fetched and executed again.        
It should be noted that the above-mentioned processing that disables an instruction and starts over beginning with fetching an instruction is referred to as a “flush”. In addition, processing that becomes necessary when prediction is incorrect is referred to as “recovery”.
FIG. 2 is a flowchart illustrating recovery processing that becomes necessary when this load value prediction is incorrect.
Here, it is assumed that the group of instructions in FIG. 3 is given as specific instruction sequence. It should be noted that S1, S2, S3, and S10 in FIG. 3 indicate respective register names (register numbers).
To begin with, the instruction fetch unit 111 fetches instructions in the order of lw (load word), addi, and add instructions (step S21), and the instruction window buffer 112 holds instructions in the same order. Here, in the case where the fetched lw instruction is a target for prediction (step S22Y), the load value prediction unit 116 writes the load prediction value in the register S1 (register in the register file 117 corresponding to the S1 in the instruction) (step S23).
The instruction window buffer 112 determines whether the instruction is issuable. Here, issuable means the status where the necessary operand is available, and an execution unit of destination to which instruction is to be issued is available or other instructions are not using the execution unit. It should be noted that the instruction window buffer 112 herein is regarded as employing the in-order issuance mechanism. This in-order issuance means that instruction is issued in the fetched order. Accordingly, when the first fetched instruction of instructions held in the instruction window buffer 112 becomes issuable, the instruction window buffer 112 herein issues that instruction. The issued instruction is removed from the instruction window buffer 112. This first fetched instruction of the instructions in the instruction window buffer 112 is given as that which is to be controlled to be always located at the head of the instruction window buffer 112.
In the example of FIG. 3, if a lw instruction is located at the head of the instruction window buffer 112, a value in the register S10 is available, and the load/store unit 113 is available, then the lw instruction is issued to the load/store unit 113 (step S24). The load/store unit 113 loads the value to be loaded from the main memory 121.
While the load/store unit 113 is executing the load instruction, an addi instruction, which is an instruction subsequent to the load instruction, is issued to the execution unit 114a or 114b. This is because it is not necessary to wait for the load instruction execution results since this addi instruction may use the load prediction value as a value of the register S1. Similarly, further subsequent add instruction is issued to the execution unit 114a or 114b. It should be noted that if load value prediction is not performed, subsequent addi instruction and add instruction are not issued to the execution unit 114a or 114b from the instruction window buffer 112 until the lw instruction execution results are obtained.
When the load/store unit 113 obtains the load value that is the load instruction execution result (step S25Y), the comparator 120 compares the previously predicted load value held in the value prediction buffer 119 with the load value that is the actual load instruction execution result, and notifies the commit unit 118 of the comparison result (step S27).
When the two match, that is, when load value prediction is correct (step S27Y), the load instruction is committed (completed) (step S28), the subsequent addi instruction and add instructions are committed (completed) as the respective operation results are obtained (step S29), and the operation results are written in the register files S2 and S3.
Meanwhile, when the two do not match or when load value prediction is incorrect (step S27N), the value loaded through actual load instruction execution is overwritten in the register corresponding to S1 in the register file 117, the load instruction is committed (step S30), and the addi instruction and add instruction subsequent to the load instruction are disabled (step S31). The instruction fetch unit 111 again starts a fetch from an addi instruction (step S32).
Mikko H. Lipasti, et al., “Exceeding the Dataflow Limit via Value Prediction” (IEEE, 1072-4451/96, 1996, p. 226-237) is incorporated herein by reference.
However, conventional instruction prediction for, for example, the load value and the like, has the following problems that need to be solved.
Firstly, hardware costs increase because of controls for prediction verification and prediction error recovery operation. In other words, during load value prediction, whether the predicted value is correct is verified after the value actually loaded through load instruction execution is obtained. If the value prediction is incorrect, it is necessary to re-execute, using the correct values, subsequent instructions that have been speculatively executed. Specialized hardware, such as the value prediction buffer 119 and comparator 120 shown in FIG. 1, becomes necessary for this recovery operation.
If an in-order issuing processor such as that described in FIG. 1 and FIG. 2 is used, since the order in which the load value prediction unit 116 predicts the load instruction value matches the order with which the load/store unit 113 executes the load instruction, it is sufficient for the value prediction buffer 119 to be a buffer having a first-in first-out (FIFO) configuration. Nevertheless, with an out-of-order issuing processor, since the order in which the load value prediction unit 116 predicts the load instruction value is different from the order with which the load/store unit 113 executes load instruction, there is no option but to use hardware having a more complicated configuration, for example, a content-addressable memory (CAM) configuration as this value prediction buffer.
Secondly, in order to implement prediction verification and recovery operation for a prediction error, items requiring specialized control become necessary, including for example, determination of whether or not a prediction error has occurred, or which instruction must be re-executed first when a prediction error has occurred. As the items requiring specialized control increase, the control unit becomes more complicated, and mounting of this control unit and verification operation becomes more difficult. In particular, since recovery processing affects many function blocks, it has a strong tendency to become the critical path during design, and if the control unit that performs recovery processing is complicated, it alone becomes an obstacle to satisfying the timing restrictions of the processor.