Microprocessors perform computational tasks in a wide variety of applications. Improved processor performance is almost always desirable, to allow for faster operation and/or increased functionality through software changes. In many embedded applications, such as portable electronic devices, conserving power and faster throughput are also goals in processor design and implementation.
Many modern processors employ a pipelined architecture, where sequential instructions, each having multiple execution steps, are overlapped in execution. For improved performance, the instructions should flow continuously through the pipeline. Any situation that causes instructions to stall in the pipeline can detrimentally influence performance. If instructions are flushed from the pipeline and subsequently re-fetched, both performance and power consumption suffer.
Most programs include indirect branch instructions where the actual branching behavior is not known until the indirect branch instruction is evaluated deep in the pipeline. To avoid the stall that would result from waiting for actual evaluation of the indirect branch instruction, modern processors may employ some form of branch prediction, whereby the branching behavior of indirect branch instructions is predicted early in the pipeline. Based on the predicted branch evaluation, the processor speculatively fetches (prefetches) and process instructions from a predicted address—either the branch target address (if the branch is predicted to be taken) or the next sequential address after the branch instruction (if the branch is predicted not to be taken). Whether an indirect branch instruction is to be taken or not to be taken is referred to as determining the direction of the branch.
Conventional branch prediction techniques include a branch target access cache (BTAC) positioned in a fetch stage of a processor pipeline and branch prediction logic. The BTAC stores the target address of an instruction previously fetched and is indexed by the instruction's address. I-caches are conventionally populated with instructions of various instruction types which were retrieved from a higher order cache or memory. BTACs are conventionally populated after an indirect branch instruction is actually resolved further down in the processor pipeline.
In operation, conventional branch prediction techniques perform address lookups on prefetched instructions in both a BTAC and an I-cache in parallel. If there is a miss in the BTAC, these conventional branch techniques have thus consumed power in the BTAC lookup without finding a match. If there is a hit in the BTAC, the address looked up may be considered to be an indirect branch instruction. After BTAC lookup, conventional techniques invoke the branch prediction logic to determine whether a branch target address retrieved from the BTAC should be predicted taken or not. If the branch prediction logic predicts taken, the branch prediction logic redirects instruction flow by retrieving instructions beginning from the branch target address.
Any sequential instructions which entered the processor pipeline since the branch instruction are typically flushed from the pipeline. The path defined by the BTAC lookup and subsequent branch prediction is typically a critical speed path because the shorter the timing of this path the smaller the amount of instructions which need to flushed from the processor pipeline before redirecting the instruction flow. Consequently, it is desirable for this path to be as short as possible to minimize the power expended in flushing instructions.
Conventional techniques for reducing the time of the critical path include reducing the size of the BTAC and/or organizing the BTAC in a multi-way fashion. However, by reducing the size of the BTAC, the number of potential hits and, thus, the probability for finding a branch target address in the BTAC is reduced, lowering the effectiveness of the BTAC as a whole. Furthermore, by organizing the BTAC into a multi-way fashion, indexing into the BTAC may become quicker but time spent comparing may be increased. In these situations, the BTAC may be slower than the I-cache, thus, becoming the limiting factor in the parallel lookup portion of the critical path. Therefore, it is recognized that apparatus and methods are needed to reduce the time for redirecting instruction flow when an indirect branch instruction is found in a processor pipeline without decreasing the effectiveness of branch prediction.