Microprocessors typically have the capacity to make instruction prefetch requests before the instruction being prefetched is actually needed by the microprocessor. These requests are held pending until such time as the instruction can be obtained, typically from external memory. Once an instruction is obtained, it is held in an instruction pipe to await execution. Microprocessors will sometimes execute a change of flow, also called a branch. A typical non-branching instruction assumes that the next instruction to be executed is located at the next sequential address following itself. A branch instruction, on the other hand, specifies the address of the next instruction to be executed, and it is not typically the next sequential address following the branch. A type of branch instruction, the conditional branch, branches based upon a specified condition. If the condition is true, the address of the next instruction to be executed is that specified by the conditional branch instruction. If the condition is false, the next instruction to be executed is assumed to be located at the next sequential address following the conditional branch instruction. The address specified by the branch instruction is referred to as the "branch destination". In a conditional branch, when the specified condition is true and a branch occurs it is referred to as a "branch taken". When the condition is false and the branch is not taken it is referred to as a "fall through", since the instruction execution simply "falls through" to the instruction at the next sequential address following the branch instruction.
Modern processors are also typically pipelined so that the sequence of steps associated with performing an instruction; such as prefetch, decode, and execution, overlap in time with the adjacent steps of preceding instructions. For example a sequence of instructions A,B,C might pipeline within a processor such that while A is executing, B is being decoded and C is being prefetched. As a result of the overlap, the effective execution time of an instruction is less than the sum of its prefetch, decode and execute times. In most pipelined processors, prefetches are performed assuming that no branch will occur. This assumption allows instructions to be easily prefetched and decoded in advance of when they are needed, since the next address to prefetch is simply an increment of the previously prefetched address. When a branch occurs in a pipelined processor, any prefetched instructions are invalidated (or flushed) and prefetching restarts at the branch destination address. Since the pipeline has been flushed, the first instruction cannot overlap its prefetch and decode time with that of any previous instructions, and thus there is a delay associated with executing the first instruction following a branch. This delay associated with pipelining is often referred to as the "branch penalty" and effects both conditional and unconditional branches.
In addition to the branch penalty, there can be a delay associated with stopping the prefetching of the fall through instruction path and starting the prefetch of the branch taken path. Typically there is a period of time between detecting a branch instruction and starting the prefetch for the branch destination. If an unnecessary prefetch were to be started and take longer than this period of time, it would needlessly delay the prefetch of the branch taken path. For the unconditional branch case, the instruction decode can be used to detect the branch and stop prefetch requests early in the instruction. For the conditional branch case the correct decision of whether to stop prefetch requests can not be made until after the specified condition test has been resolved. If the branch resolves to a fall through case, the best decision would to have been to allow the prefetch. If the branch resolves to a branch taken case, the best decision would have been to not run the unnecessary prefetch.
In general there have been several schemes suggested for helping to reduce branch delay. Such methods have involved detecting the branch as early as possible and attempting to predict whether the branch will be taken or will fall through. These schemes have included;
1) explicitly specifying within a field of the branch instruction whether to anticipate a branch taken or the fall through, PA1 2) associating different conditional branch test conditions with their tendency to be executed more often as branch taken or as fall through, PA1 3) prefetching both the branch taken and fall through paths, and PA1 4) associating the branch forward or backward direction with their tendency to be executed more often as branch taken or as fall through.