A basic pipeline microarchitecture of a microprocessor processes one instruction at a time. The basic dataflow for an instruction follows the steps of: instruction fetch, decode, execute, and result write back. Each stage within a pipeline or pipe must occur in order and hence a given stage can not progress unless the stage in front of it is progressing. In order to achieve highest performance, one instruction will enter the pipeline every cycle. Whenever the pipeline has to be delayed or cleared, this adds latency which in turns can be monitored by the performance a microprocessor carries out a task.
There are many dependencies between instructions which prevent the optimal case of a new instruction entering the pipe every cycle. These dependencies add latency to the pipe. One category of latency contribution deals with branches. When a branch is decoded, it can either be taken or not taken. A branch is an instruction which can either fall though to the next sequential instruction, not taken, or branch off to another instruction address, taken, and carry out execution of a different series of code. At decode time, the branch is detected, and must wait to be resolved in order to know the proper direction the instruction stream is to proceed. By waiting for potentially multiple pipeline stages for the branch to resolve the direction to proceed, adds latency into the pipeline. To overcome the latency of waiting for the branch to resolve, the direction of the branch can be predicted such that the pipe begins decoding either down the taken or not taken path. At branch resolution time, the guessed direction is compared to the actual direction the branch was to take. If the actual direction and the guessed direction are the same, then the latency of waiting for the branch to resolve has been removed from the pipeline in this scenario. If the actual and predicted direction miscompare, then decoding proceeded down the improper path and all instructions in this path behind that of the improperly guessed direction of the branch must be flushed out of the pipe, and the pipe must be restarted at the correct instruction address to begin decoding the actual path of the given branch. Because of the controls involved with flushing the pipe and beginning over, there is a penalty associated with the improper guess and latency is added into the pipe over simply waiting for the branch to resolve before decoding further. By having a high rate of correctly guessed paths, the ability to remove latency from the pipe by guessing the correct direction out weighs the latency added to the pipe for guessing the direction incorrectly.
In order to improve the accuracy of the guess associated with the guess of a branch, a branch history table (BHT) can be implemented which allows for direction guessing of a branch based on the past behavior of the direction the branch previously went. If the branch is always taken, as is the case of a subroutine return, then the branch will always be guessed as taken. IF/THEN/ELSE structures become more complex in their behavior. A branch may be always taken, sometimes taken and not taken, or always not taken. Based on the implementation of dynamic branch prediction, will determine how the BHT predicts the prediction of the branch.
When a branch is guessed taken, the target of the branch is to be decoded. The target of the branch is acquired by making a fetch request to the instruction cache for the address which is the target of the given branch. Making the fetch request out to the cache involves minimal latency if the target address is found in the first level of cache. If there is not a hit in the first level of cache, then the fetch continues through the memory and storage hierarchy of the machine until instruction address for the target of the branch is acquired. Therefore, any given taken branch detected at decode has a minimal latency associated with it that is added to the amount of time it takes the pipeline to process the given instruction. Upon missing a fetch request in the first level of memory hierarchy, the latency penalty the pipeline pays grows higher and higher the further up the hierarchy the fetch request must progress until a hit occurs. In order to hide part or all of the latency associated with the fetching of a branch target, a branch target buffer (BTB) can work in parallel with a BHT.
Given a current address which is currently being decoded from, the BTB can search for the next instruction address from this point forward which contains a branch. Along with storing the instruction addresses of branches in the BTB, the target of the branch is also stored with each entry. With the target being stored, the address of the target can be fetched before the branch is ever decoded. By fetching the target address ahead of decode, latencies associated with cache misses can be minimized to the point of the time it takes between the fetch request and the decode of the branch.
In respect to branch targets, a branch can either have a single constant target or multiple changing targets. A branch in a for loop for example has a single target which is the branch that creates the loop. A subroutine call, likewise, will have one target, that of the address of the subroutine. The RETURN of the subroutine can have multiple targets over its usage. The RETURN will branch back to the next sequential instruction in the code stream which called the subroutine. Placing multiple targets in a BTB proves ineffective for two fold. First, the BTB is searched via branch address, there is no way to determine which target would be selected as the branch with multiple targets is one branch and hence it is always located in the BTB with the same information, its instruction address. Secondly, the BTB is uniform and hence has the same amount of information for each entry within the BTB. To create multiple target entries in the BTB would be to create them for every single entry which will be a massive overuse of silicon area and power. Given a single target entry per BTB, most likely the last known target for the given branch is stored in the BTB. While other schemes could be used for the target address within the BTB, it can only predict the target correct for a single target of the given branch.
By creating a side Multiple Target Table (MTT) which contains only target branches and is addressed based on code path, the BTB can have a single additional bit per entry which states whether to predict the target via the BTB or to override the BTB with a guess via the MTT; thereby allowing multiple predictable targets for a branch to exist. Correctly identifying which target is the correct prediction based on the path that was taken to the given branch will allow for higher accuracy of predicted targets on multi-target branches and, hence, remove latency from the pipe and increase system performance.
There have been many methods to improve branch prediction which include those in the patents discussed below; however, they place focus on the prediction accuracy of direction, or concerns with branch targets deal with the use of a BTB or a call-return stack where a stack is implemented that assigns a return address (next sequential address in regard to the branch address) to the return stack whenever a call branch is encountered. U.S. Pat. No. 6,289,444—“Method and Apparatus for Subroutine Call-Return Prediction” targets prediction based on creating a table to determine branches which ‘call’ a routine in which place the routine will have a ‘return’. U.S. Pat. No. 5,935,241—“Multiple Global Pattern History Tables for Branch Prediction in a Microprocessor” deals with increasing the accuracy of guessing the direction (taken or not taken) of a given branch. U.S. Pat. No. 5,903,750—“Dynamic Branch Prediction for Branch Instructions with Multiple Targets” deals with creating a BTB that has multiple entries for each branch that is placed into the BTB. These multiple entries in U.S. Pat. No. 5,903,750 are narrowed down to one entry through the use of past history to arrive at the given branch. Such an implementation provides a large increase to the BTB size for all branches written in, if or if not, they have multiple entries; and, furthermore, it limits the number of multiple targets an entry can have by the number of entries an individual entry has.
The implementation presented in this invention allows for a total number of dynamic targets based on the size of a second lookup table, the MTT. The usage of assigning targets to a branch is dynamic, each branch call be allocated a total of multiple targets ranging from 0 to the size of the table.