1. Technical Field
The present invention relates in general to data processing and, in particular, to branch prediction. Still more particularly, the present invention relates to a data processing system, processor and method of data processing with an improved branch target address cache (BTAC).
2. Description of the Related Art
A state-of-the-art microprocessor can comprise, for example, a cache for storing instructions and data, an instruction sequencing unit for fetching instructions from the cache, ordering the fetched instructions, and dispatching the fetched instructions for execution, one or more sequential instruction execution units for processing sequential instructions, and a branch processing unit (BPU) for processing branch instructions.
Branch instructions processed by the BPU can be classified as either conditional or unconditional branch instructions. Unconditional branch instructions are branch instructions that change the flow of program execution from a sequential execution path to a specified target execution path and which do not depend upon a condition supplied by the occurrence of an event. Thus, the branch specified by an unconditional branch instruction is always taken. In contrast, conditional branch instructions are branch instructions for which the indicated branch in program flow may be taken or not taken depending upon a condition within the processor, for example, the state of specified condition register bit(s) or the value of a counter.
Conditional branch instructions can be further classified as either resolved or unresolved based upon whether or not the condition upon which the branch depends is available when the conditional branch instruction is evaluated by the BPU. Because the condition upon which a resolved conditional branch instruction depends is known prior to execution, resolved conditional branch instructions can typically be executed and instructions within the target execution path fetched with little or no delay in the execution of sequential instructions. Unresolved conditional branches, on the other hand, can create significant performance penalties if fetching of sequential instructions is delayed until the condition upon which the branch depends becomes available and the branch is resolved.
Therefore, in order to minimize execution stalls, some processors speculatively predict the outcomes of unresolved branch instructions as taken or not taken. Utilizing the result of the prediction, the instruction sequencing unit is then able to fetch instructions within the speculative execution path prior to the resolution of the branch, thereby avoiding a stall in the execution pipeline in cases in which the branch is subsequently resolved as correctly predicted. Conventionally, prediction of unresolved conditional branch instructions has been accomplished utilizing static branch prediction, which predicts resolutions of branch instructions based upon criteria determined prior to program execution, or utilizing dynamic branch prediction, which predicts resolutions of branch instructions by reference to branch history accumulated on a per-address basis within a branch history table (BHT) and/or branch target address cache (BTAC).
Modem microprocessors require multiple cycles to fetch instructions from the instruction cache, scan the fetched instructions for branches, and predict the outcome of unresolved conditional branch instructions. If any branch is predicted as taken, instruction fetch is redirected to the new, predicted address. This process of changing which instructions are being fetched is called “instruction fetch redirect”. During the several cycles required for the instruction fetch, branch scan, and instruction fetch redirect, instructions continue to be fetched along the not taken path; in the case of a predicted-taken branch, the instructions within the predicted-taken path are discarded, resulting in decreased performance and wasted power dissipation.
Several existing approaches are utilized to reduce or to eliminate the instruction fetch redirect penalty. One commonly used method is the implementation of a BTAC that in each entry caches the branch target address of a taken branch in association with the branch instruction's tag. In operation, the BTAC is accessed in parallel with the instruction cache and is searched for an entry whose instruction tag matches the fetch address transmitted to the instruction cache. If such a BTAC entry exists, instruction fetch is redirected to the branch target address provided in the matching BTAC entry. Because the BTAC access typically takes fewer cycles than the instruction fetch, branch scan, and taken branch redirect sequence, a correct BTAC prediction can improve performance by causing instruction fetch to begin at a new address sooner than if there were no BTAC present.
However, in conventional designs, the BTAC access still generally requires multiple cycles, meaning that in the case of a BTAC hit at least one cycle elapses before the taken branch redirect. The interval between the BTAC access and the instruction fetch redirect represents a “bubble” during which no useful work is performed by the instruction fetch pipeline. Unfortunately, this interval tends to grow as processors achieve higher and higher operating frequencies and as BTAC sizes increase in response to the larger total number of instructions (i.e., “instruction footprint”) of newer software applications.