1. Field of the Invention
This invention generally relates to a computer-instruction pipeline and, more particularly, to a system and method for updating global history predictions associated with a pipeline using only “taken” branch predictions.
2. Description of the Related Art
In the central processing units of general purpose computers, micro-architectures have evolved that permit the simultaneous processing of multiple instructions, for the purpose of increasing computational performance. Modern high-performance computers often employ one or more of three techniques to increase performance. Pipelined execution permits instruction execution to be partitioned into a series of sub-operations, each of which requires less time to execute than the entire instruction. Further, multiple instructions can be in different stages of execution, to allow parallel execution of instructions.
Superscalar execution is the technique of providing extra hardware to allow the execution of multiple instructions in parallel in any given stage of the pipeline. Out-of-order execution is a technique that permits many instructions to be in execution, and allows instructions to be executed in an order determined by their data dependencies, rather than the order in which they occur in the program. In general, this process permits instructions that would normally be stalled waiting for earlier instructions to complete, to bypass the “waiting” instructions.
FIG. 1 is a schematic block diagram depicting an exemplary superscalar, out-of-order pipeline (prior art). Instruction execution is divided into a number of stages, and a specific operation occurs in each stage. In the example, instruction fetch occurs in stage 0. To fetch an instruction, a fetch address must be calculated in the Instruction Fetch (IF) stage. This address is used to access the instruction (or multiple instructions, in the case of superscalar machines) from the instruction cache or memory in the IC stage. The instructions are decoded in the ID stage. The register values are read in the RS stage and possibly placed into a structure such as a reservation station to await the availability of other operands. The instructions which are ready for execution are scheduled for execution in the Sch stage. The instruction is actually executed in the EX stage. Finally, the instruction is reordered into program order and committed to the architectural resources in the WB stage. In addition, there may be queues between the stages such as an instruction buffer between the IC and ID stages.
Each stage within the pipeline must occur in order. In order to achieve high performance, one new instruction enters the pipeline every cycle, and each instruction in the pipeline moves to a new stage. Each stage takes inputs and produces outputs, which are stored in an output buffer associated with the stage. One stage's output buffer is typically the next stage's input buffer. Such an arrangement permits all the stages to work in parallel and therefore yields a greater throughput than if each instruction had to pass through the entire pipeline before the next instruction can enter the pipeline. When the pipeline is delayed or has to be cleared, latency is created in the processing of each instruction in the pipeline.
In this type of architecture, many instructions can be “in flight”; that is, in various stages of execution, simultaneously. If a stream of instructions has no branches, the pipeline can be filled and never need to be flushed. The time a program takes to execute is then a function of the number of instructions, and the number of clock cycles each instruction takes to execute. Branch instructions disturb the flow of instructions through the pipeline. A conditional branch instruction includes some condition which, if satisfied, causes the program flow to change. Branch instructions typically have dependencies on other instructions which determine condition. Branch instructions occur roughly every 5-8 instructions in typical programs.
Once a branch instruction is fetched, other instructions are typically fetched from sequential addresses following the branch. If the branch is “taken”, the fetch path must be “re-directed” to fetch instructions from the branch target address. However, instructions which sequentially followed the branch instruction that have already been fetched, must now be flushed from the pipeline, and new instructions must be fetched. This flushing and filling operation results in “bubbles” in the pipeline in which no work is done, lowering performance.
In order to reduce the number of pipeline bubbles, some architectures use a technique called “branch prediction” in which a conditional branch is predicted to be “taken” or “not taken” before the branch is actually executed. This reduces the delay between the fetching of the branch and the re-direction of the fetch stream. To the extent that the branch prediction is accurate, the pipeline stalls are reduced or eliminated.
One technique for accomplishing branch prediction is called Global History Prediction with Index Sharing or Gshare, for short. The Gshare technique uses a structure called a Branch History Table to store predictors. Each predictor is a counter that is updated when a branch instruction is executed (resolved). If the branch is “taken”, the counter is incremented and if it is “not taken”, it is decremented. The counter is saturating, such that when it reaches its maximum value, incrementing it again will not change the value and if it reaches its minimum value, decrementing it will not change the value. Typically, the counters are 2 bits in width, though this width is a design choice. The value of the predictor is used to decide if the branch prediction is “taken” or “not taken”. The intuition is that if the branch has been “taken” frequently in the near past, then it will be “taken” in the future and vice versa. Given this, if 2 bit counters are used, Table 1 shows the prediction for each value. Each counter has 4 states encoded as 00: strongly “not taken”, 01: weakly “not taken”, 10: weakly “taken”, and 11: strongly “taken”. If the BHT tracks a branch as either strongly “not taken” or weakly “not taken”, and the branch is resolved “not taken”, then the state becomes strongly “not taken”.
TABLE 1Predictor value meaningCountervaluePrediction00Strongly NOT Taken01Weakly NOT Taken10Weakly Taken11Strongly Taken
The BHT is indexed by a number which is formed by hashing the address of the branch instruction with the value in the Global History Shift Register (GHSR) or Global History Register (GHR). If the BHT contains 2N predictors, N bits are needed to select a predictor.
FIG. 2 is a schematic block diagram depicting a Gshare scheme for indexing a BHT (prior art). The GHSR is a shift register of M bits, where M is usually less than N. When a branch instruction is executed, the decision to take the branch or not is made and the branch is said to be “resolved”. When the branch is resolved, the value in the GHSR is updated by shifting in a 1 if the branch is taken and a 0 if the branch is not taken. The effect is to form a pattern of 1's and 0's which reflect the directions taken by the M most recent branches. This number is combined with the branch address, typically lower order address bits are exclusive-ORed (XOR'd) with the GHSR, to form the N-bit index. This index is then used to select the predictor in the BHT.
Processor micro-architectures with extremely long pipelines which allow many instructions in flight have a problem caused by the fact that there may be many branch instructions fetched and placed in the instruction queue, which have not yet been identified as branches and predicted. This is the fetch-to-re-direct delay and arises due to the following problem. The type of the instruction is not known until it reaches the decode stage (ID). Further, the branch is not resolved until it reaches the EX stage, when it is executed. This leads to two problems which arise when implementing the Gshare branch prediction scheme:
1. the latency between the instruction fetch stage and the branch prediction stage causes a delay in the time between when the branch prediction is made and the time when a new address is available for the fetch stage; and,
2. the latency between the branch prediction stage and the branch resolution stage causes a delay in the update of the GHSR.
The first problem is solved by ensuring that branches are always fetched using the correct GHSR value, which is accomplished by updating the GHSR on “taken” branches and flushing pre-fetched instructions in the event of redirection. The second problem is solved by speculatively updating the GHRS. The newly-created problem of having a “bad” GHSR responsive to the mis-prediction is solved by “repairing” the GHSR with a non-speculative value being tracked during branch resolution.
In the conventional Gshare scheme, the GHSR is used to form an index into the BHT by shifting in a value of one if the branch is “taken” and a value of zero if the branch is “not taken”. Thus, the GHSR is updated on every branch, whether “taken” or “not taken”. However, there is a problem that arises if there are multiple instructions in the pipeline between the fetch stage and the branch prediction stage, and the GHSR is updated on every branch. For every branch that is in the pipeline between these stages, the GHSR value is the same. That is, the value is updated in the prediction stage, and if the branch is predicted to be “taken”, the branches that follow it will use the same GHSR value in their predictions. This means that if a branch is “not taken”, each fetch from the time the branch is fetched until the update, uses incorrect data to look up entries in the BHT.
It would be advantageous if the amount of incorrect prediction information loaded into a GHR could be minimized.