A basic pipeline microarchitecture of a microprocessor processes one instruction at a time. The basic dataflow for an instruction follows the steps of: instruction fetch, decode, address computation, data read, execute, and write back. Each stage within a pipeline or pipe occurs in order; and hence a given stage can not progress unless the stage in front of it is progressing. In order to achieve highest performance for the given base, one instruction will enter the pipeline every cycle. Whenever the pipeline has to be delayed or cleared, this adds latency which in turn can be monitored by evaluating the performance of a microprocessor carrying out a task. While there are many complexities that can be added on to such a pipe design, this sets the groundwork for branch prediction theory related to the present invention.
There are many dependencies between instructions which prevent the optimal case of a new instruction from entering the pipe every cycle. These dependencies add latency to the pipe. One category of latency contribution deals with branches. When a branch is decoded, it can either be “taken” or “not taken.” A branch is an instruction which can either fall through to the next sequential instruction, that is “not taken,” or branch off to another instruction address, that is, “taken,” and carries out execution on a different sequence of code.
At decode time, the branch is detected, and must wait to be resolved in order to know the proper direction in which the instruction stream is to proceed. By waiting for potentially multiple pipeline stages for the branch to resolve the direction in which to proceed, latency is added into the pipeline. To overcome the latency of waiting for the branch to resolve, the direction of the branch can be predicted such that the pipe begins decoding either down the “taken” or “not taken” path. At branch resolution time, the guessed direction is compared to the actual direction the branch was to take. If the actual direction and the guessed direction are the same, then the latency of waiting for the branch to resolve has been removed from the pipeline in this scenario. If the actual and predicted direction miscompare, then decoding has proceeded down the improper path and all instructions in this improper path behind that of the improperly guessed direction of the branch must be flushed out of the pipe. Then the pipe must be restarted at the correct instruction address to begin decoding the actual path the given branch is supposed to take.
Because of controls involved with flushing the pipe and beginning over, there is a penalty associated with the improper guess and latency is added into the pipe over simply waiting for the branch to resolve the issue of the correct path before decoding further. By having a proportionally higher rate of correctly guessed paths, the ability to remove latency from the pipe by guessing the correct direction outweighs the latency added to the pipe for guessing the direction incorrectly.
In order to improve the accuracy of a guess for a branch, a Branch History Table (BHT) can be implemented which allows for direction guessing of a branch based on the past behavior of the direction the branch previously went. If the branch is always taken, as is the case of a subroutine return, then the branch will always be guessed as taken. IF/THEN/ELSE structures are more complex in their behavior. A branch may be always taken, sometimes taken and not taken, or always not taken. Based on the implementation of a dynamic branch predictor, this will determine how well the BHT table predicts the direction of the branch.
When a branch is guessed taken, the target of the branch is to be decoded. The target of the branch is acquired by making a fetch request to the instruction cache for the address which is the target of the given branch. Making the fetch request out to the cache involves minimal latency if the target address is found in the first level of cache. If there is not a hit in the first level of cache, then the fetch continues through the memory and storage hierarchy of the machine until the instruction text for the target of the branch is acquired. Therefore, any given taken branch detected at decoding has a minimal latency associated with it that is added to the amount of time it takes the pipeline to process the given instruction. Upon missing a fetch request in the first level of memory hierarchy, the latency penalty the pipeline pays grows higher and higher, the farther up in the hierarchy the fetch request must progress, until a hit occurs. In order to hide part or all of the latency associated with the fetching of a branch target, a BTB table can work in parallel with a BHT table.
Given a current address which is currently being decoded from an input, the BTB table can search for the next instruction address from this point forward which contains a branch. Along with storing the instruction address of branches in the BTB table, the target of the branch is also stored with each entry. With the target being stored, the address of the target can be fetched before the branch is ever decoded. By fetching the target address ahead of decoding latencies associated with cache misses can be minimized to the point of time it takes between the fetch request and the decoding of the target of the branch.
In designing a BTB table, the number of branches that can be stored therein is part of the equation that determines how beneficial the structure is. In general, a BTB table is indexed by part of an instruction address within the processor, and tag bits are stored in the BTB table such that the tag bits must match the remaining address bits of concern that were not used for the indexing. In order to improve the efficiency of the BTB table, it can be created such that it has an associativity greater than one. By creating an associativity greater than one, multiple branch/target pairs can be stored for a given index into the array. To determine which is the correct entry, if an entry at all, the tag bits are used to select one entry, at most, from the multiple entries stored for a given index.
When a branch is determined at decoding time and it was not found ahead of time by the asynchronous BTB/BHT table function, the branch is determined as a surprise branch. A surprise branch is any branch which was not found by the dynamic branch prediction logic ahead of the time of decoding. A branch is not predicted by the branch prediction logic because it was not found in the BTB/BHT tables. There are two reasons that a branch is not found in the BTB/BHT tables. If a branch is not installed in the BTB/BHT tables, then a branch can not be found, as it is nowhere to be found. The second scenario is a situation in which the branch resides in the BTB/BHT tables. However, not enough processing time has been available so that searching could find the branch prior to decoding thereof. In general, branch prediction search algorithms can have a high throughput. However, the latency required for starting a search can be a reasonable length longer compared to the starting of instructions down the pipeline in respect to the time frame that an instruction decodes.
Whenever a branch is detected at decoding time, where the branch is a surprise branch, upon knowing the target and direction of the branch at a later time, an entry can be written into the BTB and the BHT tables. Upon writing the entry into the tables, the entry can ideally be found the next time a search is in the area in the machine of the stated branch.
In the case in which a branch resides in the BTB/BHT tables but latency effects prevent the branch from being found in time, the branch is treated like a surprise branch as this branch is no different from a branch which is not in the tables. Upon determining the target and direction of the branch, it will be written into the BTB/BHT tables. A standard method of entering a branch into the tables is to place it in the given column (associativity) that was least recently used; thereby, keeping those branches which were most recently accessed in the tables. A reading of the columns prior to the write is not performed to check for duplicates because the number of reads that would have to be performed in addition to the normal operation would be enough to cause additional latency delays which would further hinder branches from being found so they could be predicted. Hence, this would increase the quantity of surprise branches in a series of code. Increasing the number of surprise branches causes the performance to decrease. In order to work around the raised latency issues, a recent entry has been designed to keep track of the recent entries into the BTB table. Through the process of this queue, additional reads from the BTB table are not required. Furthermore, the size of such a queue over a duplicate array or another read port on the given array is magnitudes different in size. The space for a second full size array or an additional read port can be deemed to be so great that the area of the machine spent for such an operation can be better spent elsewhere for higher performance gains. By adding a small recent entry queue, the area of the machine is kept modest while the performance delta between a queue and additional read port is minimal, if not for the better.
One problem encountered with the BTB table in the case of multiple instantiations of a branch entry is that the multiple instantiations of the branch entry can be written into a BTB table at a high frequency based on code looping patterns. However, this hinders the BTB table performance by removing valid entries for duplicate entries of another branch. Thus, a clear need exists for a way in which to prevent multiple instantiations of a branch entry within the BTB table.