1. Field of the Invention
This invention generally relates to a computer-instruction pipeline and, more particularly, to a system and method for improving speculative global history predictions associated with the use of a pipeline.
2. Description of the Related Art
FIG. 1 is schematic block diagram depicting a processor pipeline (prior art). In central processing units of general purpose computers, micro-architectures have evolved that permit the simultaneous processing of multiple instructions for the purpose of increasing computational performance. One technique is to pipeline the various steps needed to execute an instruction, allowing many instructions to be in various stages of execution simultaneously. As noted in U.S. Pat. No. 6,938,151, the basic dataflow for an instruction is: instruction fetch (IF), decode (ID), cache access (IC), execute (EX), and result write back (WB). Each stage within the pipeline must occur in order. In order to achieve high performance, one new instruction enters the pipeline every cycle, and each instruction in the pipeline moves to a new stage. Each stage takes inputs and produces outputs, which are stored in an output buffer associated with the stage. One stage's output buffer is typically the next stage's input buffer. Such an arrangement permits all the stages to work in parallel and therefore yields a greater throughput than if each instruction had to pass through the entire pipeline before the next instruction could enter the pipeline. When the pipeline is delayed or has to be cleared, latency is created in the processing of each instruction in the pipeline.
There are many dependencies between instructions that prevent the optimal case of a new instruction entering the pipeline every cycle. These dependencies add latency to the pipeline. One category of latency contribution deals with branches. A conditional branch is an instruction which can either fall though to the next sequential instruction (branch not taken), or branch off to another instruction address (branch taken), and carry out the execution of a different series of code. Conditional branches take the branch (that is, start executing at the target address instead of the next sequential address) if the condition evaluates to true. For example, in the Power architecture, the compare instruction compares two operands and sets the condition register. Then a following branch-if-equal instruction would branch if the two operands are equal. The compare tests the conditional and the “conditional” branch resolves to “taken” if the condition is true. The result of the taken/not taken issue is called a “resolution”.
At decode time, the branch is detected. Latency in the pipeline is created while waiting for the resolution of the branch. This latency is compounded as each instruction in the pipeline is delayed waiting for the resolution of the branch.
To overcome these latencies, the direction of the branch can be predicted such that the pipeline begins decoding, based upon the assumption that the branch is either taken or not taken. At branch resolution time, the prediction is compared to the actual results (the resolution). If the prediction is correct, latency has been minimized. If the prediction is incorrect, then decoding has proceeded down the improper path and all instructions in this path behind that of the incorrect prediction must be flushed out of the pipeline. The pipeline is restarted at the correct instruction address to begin decoding the resolved branch direction. Correct predictions minimize latency, while incorrect predictions, because of the flushing and restart operations, add greater latency than simply waiting for the branch to resolve before decoding. Thus, latency is significantly improved by making correct branch predictions.
In order to improve the accuracy of the predictions, a Branch History Table (BHT) can be implemented, which permits future predictions to be based upon the resolution of previous branch instructions. There are many algorithms for dynamically predicting a branch. One approach is to maintain a 2-bit saturating counter. Each counter has 4 states encoded as 00: strongly not taken, 01: weakly not taken, 10: weakly taken, 11: strongly taken, see Table 1. If the BHT tracks a branch either strongly not taken or weakly not taken, and the branch is resolved not taken, then the state becomes strongly not taken.
TABLE 1Predictor value meaningCountervaluePrediction00Strongly NOT Taken01Weakly NOT Taken10Weakly Taken11Strongly Taken
Further, there are different algorithms associated with the way that each table is indexed, which have profound differences on prediction accuracy. For branches which close off loops, the prediction will be correct (X−1)/X amount of the time, where X is the times the loop is processed. An indexing scheme that uses the branch instruction address works very well in such a situation. In the cases of IF/THEN/ELSE branch structures, where the direction has a higher level of conditional-based information, determining where the branch occurs and XOR'ing the pattern of the last N predictions provides a higher level of accuracy. This pattern of previous predictions may be referred to as a Global History Prediction with Index Sharing, or Gshare (Scott McFarling, “Combining Branch Predictors”, Western Research Laboratory Technical Note TN-36, 1993). Gshare predicts whether a branch is taken according to a historic index based on the instruction address, and is useful because branch instructions sometimes have the tendency to correlate to other nearby instructions.
FIG. 2 is a schematic block diagram depicting a Gshare scheme for indexing a BHT (prior art). The BHT is indexed by a number which is formed from hashing the address of the branch instruction with the value in a Global History Shift Register (GHSR). If the BHT contains 2N predictors, N bits are needed to select a predictor. The GHSR is typically a shift register of M bits, where M is usually less than N. When the branch is resolved, the value in the GHSR is updated by shifting in a “1” if the branch is taken and a “0” if the branch is not taken. The effect is to form a pattern of and 0's which reflect the directions taken by the M most recent branches. This number (GHSR) is exclusive-OR'ed with the branch address, typically the lower order address bits, to form the N-bit index. This index is then used to select the predictor in the BHT.
A problem arises when this scheme is used in a pipelined micro-architecture. As shown in FIG. 1, instruction execution in a pipelined processor is divided into a number of stages, where a specific operation occurs in each stage. In the example, instruction fetch occurs in stage 0. To fetch an instruction, a fetch address must be calculated in the Instruction Fetch (IF) stage. This address is used to access the instruction from the instruction cache or memory in the IC stage. The instruction is decoded in the ID stage. The register values are read in the RS stage and the instruction is scheduled for execution in the Sch stage. The instruction is actually executed in the EX stage and the result value is written back to the register file in the WB stage.
In this type of pipeline, the instruction type is not known until it reaches the decode stage (ID). Further, the branch is not resolved until it reaches the EX stage, when it is executed. This leads to two problems which arise when implementing the Gshare branch prediction scheme: 1) the latency between the branch prediction and the instruction fetch stage and 2) the necessity to wait until the EX stage, when the branch is resolved, to update the GHSR and the prediction counters.
If the BHT is accessed in stage 1 (IC), at the same time as the instruction cache access, hardware can be added to calculate the branch target address in the next stage (RS) and the branch target address can be formed. If the branch prediction indicates that the branch should be taken, the instructions which are in earlier stages of the pipeline are discarded, and the branch target address is used in the IF stage to start a new prefetch stream. The discarded instructions represent work that is lost because of the latency in the pipeline between the IF stage and the branch prediction stage. The number of stages between these operations should be minimized for efficiency. Since the GHSR value is needed to form the index into the BHT, and the value is formed using the outcome of the branch instruction, then BHT access, and thus the branch prediction, must conventionally wait until the branch is resolved and the GHSR is updated. In the example pipeline of FIG. 1, the branch resolution is determined in stage 5 (EX). Thus, the redirection of the fetch stream has 5 stages of latency (from stage 0 to stage 5) if the branch prediction is incorrect, causing a further loss of efficiency.
One solution to this problem is to use a GHSR value that is formed later in the pipeline to predict the branch instructions in the RS stage. This approach would mean that there might be one or more branches in the pipeline which do contribute to updating the GHSR. In other words, any given prediction would be based on “old” information. This approach leads to a lower prediction accuracy.
Another solution to this problem is to “speculatively” update the GHSR, and, possibly, the BHT. In this case, speculative means that the branch prediction formed by accessing the BHT is used to speculatively update the GHSR with a taken/not taken value. Of course, the branch may ultimately be resolved to disagree with the prediction. This result is called a branch mispredict. If the branch is mispredicted, then the GHSR has been updated with erroneous information. Similarly, if the BHT entries are updated with predicted branch information that is later found to be erroneous, it too is corrupted.
It would be advantageous if the information in a GHSR could be purged and corrected when the branch prediction are found to be false at branch resolution time.