State of the art microprocessors achieve high performance by processing multiple instructions per cycle and by implementing deep pipelines. To reduce latency while executing the instructions, processors implement predictors to predict if a branch may be taken by a branch instruction that is waiting on a concurrently executing instruction. Mispredictions occur when the branch prediction is incorrect. When a misprediction is detected, pipeline flushes occur to resume execution on the correct path. The pipeline flushes are a major limitation to processor performance. This limitation especially is harsh for deep and wide machines on most modern processors. The time between a branch misprediction and the resumption of execution on the correct path is wasted by processing instructions along the wrongly predicted path. Thus, processors that improve their branch prediction accuracy can reduce mispredictions and increase their overall performance by performing more work in less time.
FIG. 1 depicts a block diagram of an instruction pipeline that is known in the art. Instruction 1 is processed by pipeline 10. Instruction 2 and other subsequent instructions also are processed by pipeline 10. Thus, instructions share the same pipeline. Pipeline 10 follows a repeated order of stages for executing the instructions. The following discussion describes the stages for executing instructions on pipeline 10. Fetch 11 fetches the instruction 1 from memory. Decode 12 decodes instruction 1. For example, decode 12 may determine if instruction 1 is an add, load or branch instruction. Read 13 reads the source operand values of instruction 1. Instruction 1 is ready to be executed. Execute 14 executes instruction 1. Write 15 writes the result of execute 14 to memory or a register specified by instruction 1. Retire 16 retires the instruction 1, and frees resources.
Instruction 2 follows the same stages as instruction 1. Pipeline 10 uses fetch 11, decode 12, read 13, execute 14, Write 15 and retire 16 to process instruction 2. Instruction 2 is in a stage behind instruction 1 in pipeline 10. While instruction 1 is in the decode stage, instruction 2 is in the fetch stage. If an instruction 3 is fetched, then instruction 2 is in the decode stage and instruction 1 is in the read stage. Every stage is working on a different instruction at a given time. For example, instruction 1 may be ADD EAX, EBX. This instruction will add the contents of register EBX to the contents of register EAX, and store the result in register EAX. Instruction 2 may be ADD ECX, EAX. This instruction will add the contents of register EAX to the contents of register ECX, and store the results in register ECX. Instruction pipeline 10 waits until write 15 of instruction pipeline 10 to receive the value for EAX before read 13 may be executed.
Additional concerns arise when instruction 2 is a branch. Fetch 11 fetches instruction 2, but does not know which instruction is to be fetched next. Until the condition of the branch instruction 2 is resolved, fetch 11 is stalled. Thus, if instruction 2 is BRANCH (EAX=0), GO 200, fetch 11 will not fetch any more instructions until instruction 2 is processed by execute 14. Once the condition is evaluated by the execution stage, the target of the branch is known and fetch 11 resumes. Cycles are wasted as instruction 2 is being processed until execute 14 to fetch the next instruction. Modern processors seek to reduce this latency period by predicting the direction that instruction 2 will take. As discussed above, branch predictors may be used to predict when a branch is taken.
Mispredictions occur when the wrong direction is predicted by the branch predictor. In the example above, the branch predictor for instruction 2 may predict 200 as the probable branch target, which is taken. Instruction 1, however, yields a different result because EAX does not equal, causing instruction 2 to mispredict. Instructions processed after the bad fetch of the misprediction are flushed. As a result, all the work performed processing the instructions starting at address 200 is discarded, and execution resumes with the instruction sequentially following instruction 2.
Prediction schemes exist for implementing branch predictors to reduce the penalty associated with branch mispredictions. A branch predictor speculates on whether the branch is taken or not taken. Branch predictors generally include a target address buffer to record branch target addresses and a prediction table to deliver predicted directions. A target address buffer will indicate whether the target address is a branch, and the target of the branch. The prediction table may implement a prediction scheme that facilitates an accurate prediction for the branch instruction. A taken result may be indicated by a 1, and a not taken result may be indicated by a 0.
One scheme is the “last time” method that simply stores a bit in the branch predictor for every branch instruction that indicates if the branch was taken or not taken the last time the branch was executed. If the branch was taken last time, then the prediction is to take the branch. Another scheme is the “bimodal” method that stores two bits for every branch (modulo the size of the predictor tables) in the branch predictor. Like the last time method, the bimodal method updates the bits depending upon the final direction of the branch instruction. A taken branch results in an increment of the related two-bit counter while a not-taken branch results in a decrement. Counters saturate on both ends. The upper two states lead to a taken prediction, and the lower two states to a not-taken prediction.
Another scheme is the local prediction method. The local prediction method looks at the outcomes of previous instances of the current branch. The local prediction method uses a field in the target address buffer to store bits for these last N instances of that branch. For each new prediction, the bits indicating taken/not taken results will be shifted and the new outcomes inserted. Thus, older results are moved out of the prediction field, while more recent results are stored. This method still uses a prediction table with a 1 or 2 bit scheme, as discussed above. While the bimodal scheme uses only the address of the branch instruction to index the prediction table, the local scheme uses the outcome of past instances in addition to the index.
Another scheme is the global prediction method. The global prediction method looks at the outcomes of N preceding branches. A field or register builds a history, similar to the local prediction method, but the history will be of the last N previous branches in program order. As a branch is taken or not taken, the field or register shifts to update the history. The prediction table is indexed by both the address of the branch instruction and the content of this history register. A hybrid scheme also exists that combines the local and global prediction methods. This scheme may select which method to use. Both methods are executed with the results being input to a multiplexer. A predictor predicts the method that would give the best prediction.
The methods discussed above are all based on previous branch outcomes. The methods do not correlate misprediction data to improving prediction efficiency.