The art of pipeline processor performance improvements is vast; however, performance improvements usually come at the cost of chip area and power. One methodology for improving processor performance is the modification of a branch target buffer (BTB) used in branch target prediction where performance improvements can be made while reducing area.
A basic pipeline microarchitecture of a microprocessor processes one instruction at a time. The basic dataflow for an instruction follows the steps of: instruction fetch, decode, register read, execute, and write back. Each stage within a pipeline occurs in order and hence a given stage can not progress unless the stage in front of it is progressing. In order to achieve highest performance for the given base, one instruction must enter the pipeline every cycle. Whenever the pipeline has to be delayed or cleared, this adds latency which in turn can be monitored by the performance with which a microprocessor carries out a task. While there are many complexities that can be added on to a pipeline design, this sets the groundwork for branch prediction theory.
There are many dependencies between instructions which prevent the optimal case of a new instruction entering the pipeline every cycle. These dependencies add latency to the pipeline. One category of latency contribution deals with branches. In a pipeline processor, an instruction must be decoded at every clock cycle in order to sustain the pipeline, yet the branch decision doesn't occur until the Execution stage (i.e., after the Instruction Fetch, and Instruction Decode/Register Fetch Stages). This delay in determining the proper branch to take is called a “Branch Hazard” One solution to the Branch Hazard is to make assumptions or predictions about the branch being taken or not taken. For example, when a branch is decoded, it can either be taken or not taken. In this context a branch is an instruction which can either fall through to the next sequential instruction, i.e., “not taken,” or branch off to another instruction address, “taken”, and carry out execution of a different series of code. At decode time, the branch is detected, and must wait to be resolved in order to know the proper direction the instruction stream is to proceed. By waiting for potentially multiple pipeline stages for the branch to resolve the direction to proceed, latency is added into the pipeline; this is the “Branch Hazard.”.
To overcome the latency of waiting for the branch to resolve, the direction of the branch can be predicted such that the pipeline begins decoding either down the taken or not taken path. At branch resolution time, the guessed direction is compared to the actual direction the branch is to take. If the actual branch direction and the guessed branch direction are the same, then the latency of waiting for the branch to resolve has been removed from the pipeline in this scenario. If, however, the actual branch direction and the predicted branch direction miscompare, then decoding proceeded down the improper path and all instructions in this path behind that of the improperly guessed direction of the branch must be flushed out of the pipeline, and the pipeline must be restarted at the correct instruction address to begin decoding the actual path of the given branch. Because of controls involved with flushing the pipeline and beginning over, there is a penalty associated with the improper guess and latency is added into the pipeline over simply waiting for the branch to resolve before decoding further. By having a proportionally higher rate of correctly guessed paths, the ability to remove latency from the pipeline by guessing the correct direction out weighs the latency added to the pipeline for guessing the direction incorrectly.
One method of reducing Branch Hazard latency is the implementation of a Branch History Table (BHT) to improve the accuracy of the branch guesses. Specifically, to improve the accuracy of the guesses associated with the guess of a branch, the branch history table (BHT) can be implemented. The Branch History Table facilitates direction guessing of a branch based on the past behavior of the branch. If the branch is always taken, as is the case of a subroutine return, then the branch will always be guessed as taken. IF/THEN/ELSE structures become more complex in their behavior. A branch may be always taken, sometimes taken and not taken, or always not taken. Based on the implementation of a dynamic branch predictor, this will determine how well the BHT predicts the direction of the branch.
When a branch is guessed as taken, the target of the branch is decoded. The target of the branch is acquired by making a fetch request to the instruction cache for the address which is the target of the given branch. Making the fetch request out to the cache involves minimal latency if the target address is found in the first level of cache. If there is not a hit in the first level of cache, then the fetch continues through the memory and storage hierarchy of the machine until the instruction text for the target of the branch is acquired. Therefore, any given taken branch detected at decode has a minimal latency associated with it that is added to the amount of time it takes the pipeline to process the given instruction. Upon missing a fetch request in the first level of memory hierarchy, the latency penalty the pipeline pays grows higher and higher the further up the hierarchy the fetch request must progress until a hit occurs. In order to hide part or all of the latency associated with the fetching of a branch target, a branch target buffer (BTB) works in parallel with a BHT.
Given a current address which is currently being decoded from, the BTB can search for the next instruction address which contains a branch from this point forward. Along with storing the instruction address of branches in the BTB, the target of the branch is also stored with each entry. With the target being stored, the address of the target can be fetched before the branch is ever decoded. By fetching the target address ahead of decode, latencies associated with cache misses can be minimized to the point of time it takes between the fetch request and the decode of the branch.
However, even with the Branch History Table and Static Branch Target Buffer, there is substantial latency, and, thus, a clear need exists for an apparatus and method to improve the fraction of correctly guessed paths and to thereby reduce the latency of the pipeline.