The present invention relates generally to the field of computing systems, and methods for improving instruction execution, for example, in reducing branch instruction delays in pipelined processors.
Computer processors often use instruction buffers to speed up the execution of programs. Such a buffer, also referred to as an “instruction cache,” allows the processor to queue the next several instructions in a pipeline while the processor is simultaneously executing another instruction. Thus, when the processor finishes executing an instruction the next instruction in the cache is available and ready for execution.
Many modern computing systems utilize a processor having a pipelined architecture to increase instruction throughput which use one or more instruction caches implemented by hardware or firmware.
Pipelining of instructions in an instruction cache may not be effective, however, when it comes to conditional jumps. When a conditional jump is encountered, the next set of instructions to be executed will typically be either the instructions immediately following the conditional jump instruction in sequence, which is currently stored in the instruction cache, or a set of instructions at a different address, which may not be stored in the cache. If the next instruction to be executed is not located at an address within the instruction cache, the processor will be effectively paused (e.g., by executing “NOP” instructions) for a number of clock cycles while the necessary instructions are loaded into the instruction cache.
Accordingly, when a conditional branch or jump is made, the processor is likely to have to wait a number of clock cycles while a new set of instructions are retrieved. This branch instruction delay is also known as a “branch penalty.” A branch penalty may be shorter when branching to an instruction already within the cache and longer when the instruction must be loaded into the cache.
As an illustrative example of a branch penalty, consider the following set of processor instructions
Code:Inst A0Branch COND, L1Inst B0L1:Inst C0Inst C1
In this sample set of processor instructions, program execution will jump from the “Branch COND, L1” instruction to the “Inst C0” instruction at label L1 if COND is TRUE, or non-zero. Otherwise, program execution will proceed to the “Inst B0” instruction first. If the instructions at the L1 label are not in the processor's instruction buffer at the time the “Branch COND, L1” instruction is executed, the processor will have to read the instructions into the buffer, during which time no program instructions are executed. The processor clock cycles wasted by a processor awaiting instructions to be read into its instruction buffer are a measure of the branch penalty. To further illustrate a branch penalty using this scenario, consider the above set of processor instructions at execution time when the branch is taken and the instructions at L1 are not present in the processor's instruction buffer:
Execution:CycleInst A01Branch to L12NOP: penalty3NOP: penalty4. . .NOP: penalty16NOP: penalty17Inst C018Inst C119
For simplicity, this example assumes a single clock cycle per instruction, branch or NOP. Also, the number of clock cycles needed to provide the instruction “Inst C0” is only for exemplary purposes, i.e., the “Inst C0” instruction may be available to the processor before or after clock cycle 18. In the above example, although the “Branch to L1” instruction was executed at clock cycle 2, the instruction “Inst C0” at L1 was not available to the processor in its instruction buffer until clock cycle 18. The branch penalty in this example is thus 16 clock cycles.
Several methods have been developed to minimize or eliminate the branch penalty. These methods include both hardware and software approaches. Hardware methods have included the development of processor instruction pipeline architectures that attempt to predict whether an upcoming branch in an instruction set will be taken, and pre-fetch or pre-load the necessary instructions into the processor's instruction buffer. In theory, pipelined processors can execute one instruction per machine clock cycle when a well-ordered, sequential instruction stream is executed. This may be accomplished even though each instruction itself may implicate or require a number of separate micro-instructions to be effectuated.
In one pipeline architecture approach, a fixed algorithm is employed to predict if an instruction branch will be taken. This approach has a drawback in that the fixed algorithm is not changeable, and thus cannot be optimized for each program executed on the processor.
Another hardware approach uses a branch history table (“BHT”) to predict when a branch may be taken. A BHT may be in the form of a table of bitmaps wherein each entry corresponds to a branch instruction for the executing program, and each bit represents a single branch or no-branch decision. Some BHT's provide only a single bit for each branch instruction, thus the prediction for each occurrence of the branch instruction corresponds to whatever happened last time. This is also known as 1-bit dynamic prediction. Using 1-bit prediction, if a conditional branch is taken, it is predicted to be taken the next time. Otherwise, if the conditional branch is not taken, it is predicted to not be taken the next time.
A BHT is also used to perform 2-bit dynamic prediction. In 2-bit dynamic prediction, if a given conditional branch is taken twice in succession, it is predicted to be taken next time. Likewise, if the branch is not taken twice in succession, it is predicted to not be taken the next time. If the branch is both taken once and not taken once in the prior two instances, then the prediction for the next instance is the same as the last time. Generally, if the branch is used for loop or exception handling, 2-bit dynamic prediction using a BHT is better than 1-bit because the branch is taken only once per loop or execution. Two-bit prediction tends to be more accurate and have a greater hardware cost. Therefore, if the state does not change frequently, 1-bit prediction may be sufficient for many purposes.
A BHT uses a significant amount of processor hardware resources, and may also result in significant branch penalties.
In a software approach, an instruction is provided that causes a pre-fetch of instructions starting at a specified address. In one implementation, this software takes the form of a HINT(ADDRESS) instruction, wherein the processor automatically pre-fetches the instructions beginning at ADDRESS as soon as the HINT is encountered. By placing the HINT several instructions before the actual branching instruction, the programmer can reduce the number of clock cycles during which the processor is awaiting the fetch of the next instruction.
One drawback to the use of the HINT(ADDRESS) instruction is that it unconditionally causes the pre-fetching of instructions. Thus, whether or not the branch is taken, the instructions at the branch address will be pre-fetched. The programmer must therefore decide where to place the HINT(ADDRESS) instruction, and the programmer may not always make the best predictions as to when the branch will be taken. If the programmer is incorrect, i.e., if the HINT(ADDRESS) instruction is given and the branch is not taken, the pre-fetch is a wasted effort. A significant time penalty is thus incurred from having to squash the erroneous instruction, flush the pipeline and re-load the correct instruction sequence. Depending on the size of the pipeline, this penalty can be quite large.
The hardware and software approaches to minimizing or eliminating branch penalties also do not account for the fact that the probability of conditions for branching varies throughout the execution of a program.