Central processing units (CPUs) normally predict the direction and target of branch instructions early in a processing pipeline in order to boost performance. Information about the type, location, and target of a branch instruction is typically cached in a branch target buffer (BTB), which is accessed using an instruction fetch address, and uses a content addressable memory (CAM) to detect if the BTB contains a branch that maps to the current fetch window. A BTB can also use a set associative structure to detect whether the BTB contains a branch that maps to the current fetch window. A conventional BTB is typically a large structure, and when combined with a branch direction predictor, results in at least a one cycle penalty (i.e., bubble) for a predicted-taken branch. In some cases, the conventional BTB may even incur a penalty for a predicted not-taken branch.
Some attempts have been made to address the penalty by using a loop buffer or similar structure to hide the predicted-taken branch bubble, but these approaches have limitations. Loop buffers require that all of the instructions in the loop fit within the loop buffer, not just the branch instructions. Smaller and simpler BTBs that do not incorporate a conditional branch predictor cannot accurately predict branches with dynamic outcomes and will result in wasted performance and energy. Furthermore, smaller and simpler BTBs that do not employ links will waste energy on CAM operations.