The present invention relates generally to the field of computing systems, and methods for improving instruction execution, for example, in updating a branch history table (“BHT”) used for predictive branching and improving throughput in pipelined processors.
Computer processors often use fetch engine architectures to speed up the execution of programs. The fetch engine architectures utilize fetch engines, instruction buffers and instruction caches to queue several instructions in a pipeline for future execution while the processor is simultaneously executing another instruction. Thus, when the processor finishes executing an instruction, the next instruction is available and ready for execution. Many modern computing systems utilize a processor having a pipelined architecture to increase instruction throughput.
Pipelining of instructions in an instruction cache may not be effective, however, when it comes to conditional jumps or branches. When a conditional jump is encountered, the next set of instructions to be executed will typically be either the instructions immediately following the conditional jump instruction in sequence, which is currently stored in the instruction cache, or a set of instructions at a different address, which may not be stored in the cache. If the next instruction to be executed is not located at an address within the instruction cache, the processor will be effectively paused (e.g., by executing no operations, commonly referred to as “NOP” instructions) for a number of clock cycles while the necessary instructions are loaded into the instruction cache.
Accordingly, when a conditional branch or jump is made, the processor is likely to have to wait a number of clock cycles while a new set of instructions are retrieved. This branch instruction delay is also known as a “branch penalty.” A branch penalty will typically be shorter when branching to an instruction already contained within the cache, and longer when the instruction must be loaded into the cache.
Several methods have been developed in an attempt to minimize the branch penalty. These methods include both hardware and software approaches. Hardware methods have included the development of processor instruction pipeline architectures that attempt to predict whether an upcoming branch in an instruction set will be taken, and pre-fetch or pre-load the necessary instructions into the processor's instruction buffer.
In one pipeline architecture approach, a branch history table (“BHT”) is used to predict when a branch may be taken. A BHT may be in the form of a table of bits, wherein each entry corresponds to a branch instruction for the executing program, and each bit represents a single branch or no-branch decision. The contents of the BHT could indicate what happened on the last branch decision, and functions to predict what will happen on the next branch. Some BHT's provide only a single bit for each branch instruction, thus the prediction for each occurrence of the branch instruction corresponds to whatever happened last time. This is also known as 1-bit dynamic prediction. Using 1-bit prediction, if a conditional branch is taken, it is predicted to be taken the next time. Otherwise, if the conditional branch is not taken, it is predicted to not be taken the next time.
A BHT can also be used to perform 2-bit dynamic prediction. In 2-bit dynamic prediction, if a given conditional branch is taken twice in succession, it is predicted to be taken next time. Likewise, if the branch is not taken twice in succession, it is predicted to not be taken the next time. If the branch is both taken once and not taken once in the prior two instances, then the prediction for the next instance is the same as the last time. Generally, if the branch is used for loop, 2-bit dynamic prediction using a BHT is better than 1-bit because the branch is NOT taken only once per loop. A BHT uses a significant amount of processor hardware resources, and may still result in significant branch penalties.
When a BHT predicts branching incorrectly, also known as “branch misdirection,” the BHT should be updated. This involves rewriting the bitmap for the particular branch instruction that is being executed in accordance with the particular prediction scheme being used.
Instruction pipelines or instruction caches often use 2-port random access memory (“RAM”), which allows for simultaneous “fetches” (reads) and “updates” (writes), thereby improving processor throughput in general. Processor architectures using 2-port RAM can be expensive, however, both in terms of actual cost and in design time. Using a 2-port RAM simplifies the situation that occurs when a BHT is used and a conditional jump instruction causes a jump to an instruction that is in the instruction cache. In this case, the 2-port RAM permits a new instruction to be fetched at the same time the BHT is updated.
Use of 1-port RAM instead of a 2-port RAM can be preferred because of lower cost and design time. A 1-port RAM, however, does not allow simultaneous fetches (reads) and updates (writes). Use of 1-port RAM has several potential drawbacks, such as reducing processor pipeline throughput as well as the BHT “hit ratio,” i.e., the proportion of “correct” branching predictions made due to the BHT. As an example, in the previously mentioned condition, when a BHT is used and a conditional jump instruction causes a jump to an instruction that is within the instruction cache, a problem arises. In this case, the fetching of the next instruction and the updating of the BHT cannot occur at the same time. This can adversely affect processor performance.
Since BHT updating and instruction fetching both require RAM access, it is possible to significantly slow system performance by selection of an incorrect mode of updating the BHT. Such a system slowdown can be particularly severe in the case of 1-port RAM, since updates and fetches cannot be performed simultaneously.