The present disclosure relates generally to branch prediction in a processor pipeline and, in particular, to a method and system for power conservation in a hierarchical branch predictor.
A basic pipeline microarchitecture of a microprocessor processes one instruction at a time. The typical dataflow for an instruction follows the steps of: instruction fetch, decode, execute, and result write back. Each stage within a pipeline or pipe must occur in order and hence a given stage cannot progress unless the stage in front of it is progressing. In order to achieve the highest performance, one instruction will enter the pipeline every cycle. Whenever the pipeline has to be delayed or cleared, this adds latency which, in turn, can be monitored by the performance of a microprocessor as it carries out a task.
There are many dependencies between instructions that prevent the optimal case of a new instruction entering the pipe every cycle and which add latency to the pipe. One category of latency contribution deals with branches. When a branch is decoded, it can either be taken or not taken. A branch is an instruction which can either fall through to the next sequential instruction (not taken) or branch off to another instruction address (taken), and carry out execution of a different series of code. At decode time, the branch that is detected must wait to be resolved in order to know the proper direction in which the instruction stream is to proceed. Waiting for potentially multiple pipeline stages for the branch to resolve the direction in which to proceed adds latency into the pipeline. To overcome the latency of waiting for the branch to resolve, the direction of the branch can be predicted such that the pipe begins decoding either down the ‘taken’ or ‘not taken’ path. At branch resolution time, the guessed direction is compared to the actual direction the branch was to take. If the actual direction and the guessed direction are the same, then the latency of waiting for the branch to resolve has been removed from the pipeline. If, however, the actual and predicted directions miscompare, then the decoding performed down the improper path (and all instructions in this path behind that of the improperly guessed direction of the branch) must be flushed out of the pipe, and the pipe must be restarted at the correct instruction address to begin decoding the actual path of the given branch. Because of the controls involved with flushing the pipe and beginning over, there is a penalty associated with the improper guess and latency is added into the pipe, as compared to simply waiting for the branch to resolve before decoding further. By having a high rate of correctly guessed paths, the ability to remove latency from the pipe outweighs the latency added to the pipe for guessing the direction incorrectly.
In order to improve the accuracy of the guess associated with a branch, a branch history table (BHT) can be implemented, which allows for direction guessing of a branch based on the past behavior of the direction the branch previously went. If the branch is always taken, as is the case of a subroutine return, then the branch will always be guessed as taken. However, IF/THEN/ELSE structures become more complex in their behavior. For example, a branch may be always taken, sometimes taken and not taken, or always not taken.
When a branch is guessed taken, the target of the branch is to be decoded. The target of the branch is acquired by making a fetch request to the instruction cache for the address which is the target of the given branch. Making the fetch request out to the cache involves minimal latency if the target address is found in the first level of cache. However, if there is not a hit in the first level of cache, then the fetch continues through the memory and storage hierarchy of the machine until the instruction address for the target of the branch is acquired. Therefore, any given taken branch detected at decode has a minimal latency associated with it that is added to the amount of time it takes the pipeline to process the given instruction. Upon missing a fetch request in the first level of memory hierarchy, the latency penalty the pipeline pays grows higher and higher the further up the hierarchy the fetch request must progress until a hit occurs. In order to hide part or all of the latency associated with the fetching of a branch target, a branch target buffer (BTB) can work in parallel with a BHT.
Given a current address which is currently being decoded, the BTB can search for the next instruction address from this point forward which contains a branch. Along with storing the instruction addresses of branches in the BTB, the target of the branch is also stored with each entry. With the target being stored, the address of the target can be fetched before the branch is ever decoded. By fetching the target address ahead of decode, latencies associated with cache misses can be minimized to the point of the time it takes between the fetch request and the decode of the branch.
A hierarchical branch predictor can improve performance by balancing latency and capacity effects. Such a configuration consists of multiple predictors of varying sizes and latencies since larger structures have longer access times than smaller ones. In a hierarchical predictor, power is consumed in order to access all of the structures in parallel. However, in cases where only some levels of the hierarchy are sufficient to provide accurate branch predictions, power is wasted by accessing the unused levels of the hierarchy.
It would be beneficial to conserve power in a hierarchical predictor. Accordingly, there is a need in the art for an approach to perform hierarchical branch prediction processes that conserve power.