As computer designers have designed increasingly higher performance implementations of various computer architectures, a number of classes of techniques have been developed to achieve these increases in performance. Broadly speaking, many of these techniques can be categorized as forms of pipelining, caching, and hardware parallelism.
For most architectures, branching or (sequential) control transfer type instructions are an important class of instructions. For high performance implementations, this is particularly true due to the break in the sequential fetching of instructions that is implied by branches, jumps, calls, and returns. This stems from the attempt to maximize the pipelined, parallel processing of multiple instructions during each clock cycle. This characteristic results in the need to fetch subsequent instructions into the CPU pipeline at a rate approaching one per clock cycle.
While processing physically sequential instructions (in main memory), this is relatively easily attained since fetch address generation and instruction fetch latency can be appropriately pipelined. When a control transfer instruction is encountered, though, this instruction fetching process must be redirected to the target address of the instruction and fetching restarted. This inherently prevents fetching of the correct next instruction to be processed without delay, and is often further exacerbated by the need to first calculate the target address.
One technique that can be used to eliminate or hide the negative performance impact of control transfer instructions is to utilize a branch target cache (BTC). This structure functions as a specialized form of instruction cache which holds only the first several target instructions of a control transfer. By associating each BTC entry with the address or program counter of the control transfer instruction, the BTC can be accessed based on the fetch address of a control transfer instruction. By doing this in parallel with the decoding of what turns out to be a control transfer, the first target instruction can be "fetched" out of the BTC immediately after fetching of the control transfer instruction and substituted for the sequentially fetched instruction which is no longer desired.
The contents of the BTC entry are "fetched" or transferred into an instruction queue from which the CPU instruction decoder's instruction register is loaded. Depending on the typical instruction fetch latencies within an implementation, this queue may have the capacity to hold several words of fetched instructions. The capacity of a BTC entry may similarly be several words in size so as to hide the latency involved in restarting instruction fetching after a control transfer.
In any case, all of the instruction words in the BTC must be transferred to the instruction queue and in a sufficiently short period of time so as to not hold up the decoding of target instructions.
When a control transfer instruction is encountered without an associated BTC entry, instructions are, of course, fetched into the instruction queue directly from memory. In addition, a new BTC entry will be set up so that future encounters of the control transfer will find target instructions in the BTC. As instruction words are received from memory during this first encounter, they are also loaded into this new BTC entry.
While many CPU implementations employ a single instruction queue, higher performance designs will employ multiple queues in conjunction with parallel or interleaved fetching down multiple instruction streams. For example, the CPU will initially be processing one sequential instruction stream. When a conditional control transfer is encountered, a new stream is created starting at the target address. While the direction of the control transfer remains uncertain, fetching down both the sequential and target streams will be performed in conjunction with further pipeline processing down the stream predicted as more likely to be the correct branch direction. (Note that in the case of unconditional control transfers, the original queue can be immediately reused for the new fetch stream.)
Additional queues may also be present to support similar handling of further conditional control transfer instructions encountered on the predicted instruction stream while the first conditional control transfer remains unresolved. While a diminishing returns effect quickly sets in, usage of three instruction queues (and support for up to two unresolved conditional control transfers at a time) can be justified in very high performance implementations utilizing deep pipelining.
With CPU implementations such as this, though, substantial hardware costs can be incurred for the instruction queues themselves (each of which may be up to 32 bytes in size) for support circuitry for managing the queues and for routing instruction words between various elements. As conditional control transfers are encountered and as they are resolved as having been mispredicted, instruction processing must switch between the appropriate instruction queues. In parallel with this, interleaved fetching of instruction words in multiple queues and loading into associated new BTC entries must take place. As far as routing, the instruction register must be loadable from each of the queues; each queue loadable from the BTC and memory; and the BTC from memory in parallel with loading of a queue.