The present invention relates generally to the field of processors and in particular to a block-based branch target address cache having a sliding window organization.
Microprocessors perform computational tasks in a wide variety of applications. Improving processor performance is a sempiternal design goal, to drive product improvement by realizing faster operation and/or increased functionality through enhanced software. In many embedded applications, such as portable electronic devices, conserving power and reducing chip size are also important goals in processor design and implementation.
Most modern processors employ a pipelined architecture, where sequential instructions, each having multiple execution steps, are overlapped in execution. This ability to exploit parallelism among instructions in a sequential instruction stream contributes significantly to improved processor performance. Under ideal conditions and in a processor that completes each pipe stage in one cycle, following the brief initial process of filling the pipeline, an instruction may complete execution every cycle.
Such ideal conditions are never realized in practice, due to a variety of factors including data dependencies among instructions (data hazards), control dependencies such as branches (control hazards), processor resource allocation conflicts (structural hazards), interrupts, cache misses, and the like. A major goal of processor design is to avoid these hazards, and keep the pipeline “full.”
All real-world programs include branch instructions, which may comprise unconditional or conditional branch instructions. The actual branching behavior of branch instructions is often not known until the instruction is evaluated deep in the pipeline. This generates a control hazard that stalls the pipeline, as the processor does not know which instructions to fetch following the branch instruction, and will not know until the branch instruction evaluates. Most modern processors employ various forms of branch prediction, whereby the branching behavior of conditional branch instructions and branch target addresses are predicted early in the pipeline, and the processor speculatively fetches and executes instructions, based on the branch prediction, thus keeping the pipeline full. If the prediction is correct, performance is maximized and power consumption minimized. When the branch instruction is actually evaluated, if the branch was mispredicted, the speculatively fetched instructions must be flushed from the pipeline, and new instructions fetched from the correct branch target address. Mispredicted branches adversely impact processor performance and power consumption.
There are two components to a branch prediction: a condition evaluation and a branch target address. The condition evaluation (relevant only to conditional branch instructions, of course) is a binary decision: the branch is either taken, causing execution to jump to a different code sequence, or not taken, in which case the processor executes the next sequential instruction following the conditional branch instruction. The branch target address (BTA) is the address to which control branches for either an unconditional branch instruction or a conditional branch instruction that evaluates as taken. Some common branch instructions include the BTA in the instruction op-code, or include an offset whereby the BTA can be easily calculated. For other branch instructions, the BTA is not calculated until deep in the pipeline, and thus must be predicted.
One known technique of BTA prediction utilizes a Branch Target Address Cache (BTAC). A BTAC as known in the prior art is a fully associative cache, indexed by a branch instruction address (BIA), with each data location (or cache “line”) containing a single BTA. When a branch instruction evaluates in the pipeline as taken and its actual BTA is calculated, the BIA and BTA are written to the BTAC (e.g., during a write-back pipeline stage). When fetching new instructions, the BTAC is accessed in parallel with an instruction cache (or I-cache). If the instruction address hits in the BTAC, the processor knows that the instruction is a branch instruction (this is prior to the instruction fetched from the I-cache being decoded) and a predicted BTA is provided, which is the actual BTA of the branch instruction's previous execution. If a branch prediction circuit predicts the branch to be taken, instruction fetching begins at the predicted BTA. If the branch is predicted not taken, instruction fetching continues sequentially.
Note that the term BTAC is also used in the art to denote a cache that associates a saturation counter with a BIA, thus providing only a condition evaluation prediction (i.e., taken or not taken). That is not the meaning of this term as used herein.
High performance processors may fetch more than one instruction at a time from the I-cache. For example, an entire cache line, which may comprise, e.g., four instructions, may be fetched into an instruction fetch buffer, which sequentially feeds them into the pipeline. Patent application Ser. No. 11/089,072, assigned to the assignee of the present application and incorporated herein by reference, discloses a BTAC storing two or more BTAs in each cache line, and indexing a Branch Prediction Offset Table (BPOT) to determine which of the BTAs is taken as the predicted BTA on a BTAC hit. The BPOT avoids the costly hardware structure of a BTAC with multiple read ports, which would be necessary to access the multiple BTAs in parallel.
Patent application Ser. No. 11/382,527, “Block-Based Branch Target Address Cache,” assigned to the assignee of the present application and incorporated herein by reference, discloses a block-based BTAC storing a plurality of entries, each entry associated with a block of instructions, where one or more of the instructions in the block is a branch instruction that has been evaluated taken. The BTAC entry includes an indicator of which instruction within the associated block is a taken branch instruction, and the BTA of the taken branch. The BTAC entries are indexed by the address bits common to all instructions in a block (i.e., by truncating the lower-order address bits that select an instruction within the block). Both the block size and the relative block borders are thus fixed.
The block-based BTAC works well where each block includes only one taken branch instruction. When two or more branch instructions within a block evaluate as taken a decision must be made to store one branch instruction's BTA and not another's, leading to a performance and power degradation when the other branch evaluates taken. Multiple BTAs could be stored in each BTAC entry; however this wastes valuable silicon area in the usual case where instruction blocks do not include as many taken branch instructions as BTA storage locations in the BTAC entry.