1. Field of the Invention
The present invention relates generally to processors having a branch prediction unit, and more particularly to a branch target buffer (BTB) within the branch prediction unit. The invention is adapted for use with processors having pipelined architectures, including both single and superscalar pipelined architectures.
2. Description of the Related Art
Early computers where designed to finish processing one instruction before beginning the next instruction in a sequence of instructions. Over the years, major architectural advances have been made to these early designs that dramatically improve performance. Pipelined and superscalar architectures are examples of these advances.
Processor performance has also been enhanced by the uses of caches. Caches are memory elements used to store and provide frequently used information. The term “information” broadly includes both data and instructions. Caches typically provide information in a single clock cycle, as compared with conventional memory access operations which may require several cycles. The so-called Branch Target Buffer (“BTB”) is one example of a processor cache.
The utility of a BTB is readily understood in the context of a pipelined architecture. “Pipelining” (or “speculative execution”, as it is also known) is a term generally referring to an operating approach in which a sequence of instructions is processed using a series of functional steps or processing stages. Each processing stage usually accomplishes its constituent operation(s) within a single clock cycle.
Unlike a non-pipelined processor that processes each instruction to completion before beginning the next instruction, a pipelined processor simultaneously processes several instructions in different processing stages of the pipeline. Pipeline stages may be arbitrarily designated by a designer, but generally include; instruction fetching, instruction decoding, instruction execution, and execution resolution stages.
An instruction fetch stage retrieves an instruction from wherever it is currently stored (e.g., a main system memory or an instruction queue). Once fetched, the instruction is passed to a decoder stage that typically determines an instruction address and/or instruction operand. From the decoder stage, the instruction passes to an execution stage that executes one or more operations indicated by the instruction. The execution resolution stage generally involves writing-back the results (e.g., results data) generated by execution of the instruction to one or more registers or memories for later use.
An instruction preferably passes from one pipeline stage to the next during a prescribed period. Thus during a first period, the instruction fetch stage fetches a first instruction from storage and aligns it within associated hardware register(s) for decoding. During a second period, the instruction fetch stage fetches a second instruction from storage and aligns it, while the instruction decoder stage decodes the first instruction. During a third period, the first instruction initiates an execution operation (e.g., a logical, mathematical, addressing, or indexing operation) in the execution stage, while the instruction decoder stage decodes the second instruction, and the instruction fetch stage fetches a third instruction. Pipelining continues through execution resolution, and in this manner the overall operating speed of the processor is dramatically improved over non-pipelined architectures.
In a superscalar architecture, two or more instructions may be processed and/or executed simultaneously. That is, superscalar systems have two or more execution (or decode/execution) paths capable of simultaneously and independently executing a plurality of instructions in parallel. Scalar systems may only execute one instruction per period, whether the instruction emerges from a pipelined sequence of instructions, or is executed in a non-pipelined manner. The simultaneous execution of multiple instructions further increases the performance of a processor.
Pipelining provides unquestioned performance benefits, so long as the sequence of instructions to be processed remains highly linear, or predictable. Unfortunately, most instruction sequences contain numerous instructions capable of introducing non-sequential execution paths. So-called “branch instructions” (including, for example, jump, return, and conditional branch instructions) produce a significant performance penalty in a pipelined processor—unless an effective form of branch prediction is implemented. The performance penalty arises where an unpredicted (or erroneously predicted) branch instruction causes a departure from the sequence of instructions currently pipelined within the processor. Where this occurs, the currently pipelined sequence of instructions must be throw out or “flushed,” and a new sequence of instructions must be loaded into the pipeline. Pipeline flushes waste numerous clock cycles and generally slow the execution of the processor.
One way to increase execution performance associated with a branch instruction is to predict the outcome of the branch instruction, and insert a predicted instruction into the pipeline immediately following the branch instruction. If such a branch prediction mechanism is successfully implemented in a processor, then the performance penalty associated with pipeline flushes is incurred only if the branch instruction outcome is incorrectly predicted. Fortunately, conventional techniques and analyses have determined that the outcome of many branch instructions can be correctly predicted with a high degree of certainty—approaching 80% for some applications.
As a result, several conventional types of branch prediction mechanisms have been developed. One type of branch prediction mechanism uses a BTB to store numerous data entries, wherein each data entry is associated with a branch instruction. The BTB thus stores a number of so-called “branch address tags,” each branch address tag serving as an index of sorts for a corresponding branch instruction. In addition to the branch address tag, each BTB entry may further include a target address, an instruction opcode, branch history information, and possibly other data. In a processor utilizing a BTB, the branch prediction mechanism monitors each instruction entering the pipeline. Usually, the instruction address is monitored, and where the instruction address matches an entry in the BTB, the instruction is identified as a branch instruction. From associated branch history information, the branch prediction mechanism determines whether or not the branch is likely to be taken. Branch history information is typically determined by a state machine that monitors each branch instruction indexed in the BTB and defines data stored as branch history information in relation to whether or not the branch has been taken in preceding operations.
Where the branch history information indicates that the branch instruction is likely to be taken, one or more predicted instruction(s) is inserted in the pipeline. Conventionally, each BTB data entry includes opcode(s) associated with the branch instruction being evaluated in relation to its branch history information. Upon an appropriate indication from the branch prediction mechanism, these opcode(s) may be inserted directly into the pipeline. Also, each BTB data entry includes a “target address” associated with the branch instruction being evaluated. Again, upon an appropriate indication from the branch prediction mechanism, this target address is output by the branch prediction unit as a “predicted address” and used to fetch the next instruction in the instruction sequence.
Processing of the branch instruction and its succeeding instructions proceeds through the pipeline for several periods until the branch instruction has been executed in the execution stage. It is only at this point that accuracy of the branch prediction becomes known. If the outcome of the branch instruction was correctly predicted, the branch target address has already been moved through the pipeline in its proper order, and processor execution may continue without interruption. However, if the outcome of the branch instruction was incorrectly predicted, the pipeline is flushed and a correct instruction or instruction sequence is inserted in the pipeline. In a superscalar processor, which has two or more pipelines processing multiple instruction sequences, the performance penalty caused by an incorrect branch prediction is even greater because, in most cases, at least twice the number of instructions must be flushed.
FIG. 1 illustrates a conventional BTB 1 connected to branch prediction logic 2 and related hardware. BTB 1 generally comprises an instruction address decoder 3, a memory array 4, and a sense amplifier 5. Address decoder 3 receives an instruction address from an instruction fetch unit and selects a word line associated with the decoded instruction address. Word line selection is conventionally performed, but typically includes applying a word line voltage to the selected word line. As is customary, a plurality of word lines extend from address decoder 3 through memory array 4 in a row-wise manner.
Memory array 4 comprises numerous memory cells each storing at least one data bit. Data entries, each comprising a number of data bits, are conveniently stored in rows such that selection of a particular word line essentially accesses a corresponding data entry. Data entries include at least one data field defining a branch address tag and another data field defining a target address. A word line-selected data entry is conventionally output from memory array 4 through sense amplifier 5.
From sense amplifier 5, the branch address tag is communicated to a tag compare register 6 which also receives the instruction address. The target address from sense amplifier 5 is communicated to multiplexer 7, along with an address associated with the non-branching sequence of instructions (e.g., a program counter value +4 for a 32-bit instruction word processor). One of these two multiplexer inputs is selected for communication to the instruction queue (shown here a PC multiplexer 8) by operation of a logic gate 9 receiving results from tag compare register 6 and a Taken/Not-Taken indication from branch prediction logic 2.
This type of conventional BTB suffers from a number of drawbacks. First, left in the configuration shown in FIG. 1, the memory array of the BTB will be accessed by every branch instruction, without regard to the likely outcome of the instruction. BTB access typically involves executing a conventional READ operation in relation to the word line selected by the address decoder. Each READ operation draws power from a supply in order to energize the plurality of memory cells associated with the selected word line and output data from these memory cells.
In response to this wasteful state of affairs, other conventional BTB designs have integrated an enable line into the memory array design. U.S. Pat. No. 5,740,417 is an example. As shown in this document, the proposed BTB includes an enable line which enables or disables word line drivers associated with the memory array during READ operations. The word line drivers are enabled or disabled on the basis of a Taken/Not-Taken state which predicts whether a particular instruction is unlikely to be taken or not. For example, where a Taken/Not-Taken state for a particular instruction indicates a “strongly not taken” state, the enable line transitions to an inactive level, thereby disabling the memory array word line drivers.
Unfortunately, this conventional approach to saving power during BTB access operations comes with high overhead. Namely, the branch prediction mechanism generating the enable signal requires both time and resources to “pre-decode” the instruction, determine its branch history data and Taken/Not-Taken state, and thereafter change, as necessary, the level of the enable signal.
As the rate of instruction execution and the depth of instruction pipelines increase, the accuracy and speed of branch predictions becomes increasingly important. Recognizing this importance, many conventional processors incorporate extensive pre-decoding schemes whereby all instructions are evaluated, branch instructions are identified, and branch history information is either retrieved or dynamically calculated in relation to a branch instruction currently being evaluated and pre-decoded. Needless to say, such approaches to predicting branch instruction behavior takes considerable time and requires significant additional resources. Additional delay and complexity in the processing of instruction sequences are not desirable attributes in a branch prediction mechanism. Yet, this is exactly what many conventional approaches provide.
The issue of power consumption further complicates the design of competent branch prediction mechanisms. Not surprisingly, contemporary processors are finding application is a range of devices characterized by serious constraints upon power consumption. Laptop computers and mobile devices, such as handsets and PDAs, are ready examples of devices incorporating processors preferably consuming a minimum of power.
As noted above, the BTB is a cache type memory storing a potentially numerous number of data entries. Thus, the BTB has at its core a memory array, and preferably a volatile memory array. Such memory arrays, and particularly ones sufficiently large to store numerous data entries, are notoriously power hunger. Each access to the BTB by the branch prediction mechanism implicates a “READ” operation to the BTB memory array. All agree that BTB access operations are increasing, and some estimates suggest that READ operations to the BTB memory array account for as much as 10% of the overall power consumption in conventional processors.
Clearly, a better approach to the implementation of branch prediction mechanisms within emerging processors is called for. Conventional approaches requiring lengthy, real-time evaluation of branch instructions and/or the dynamic retrieval or calculation of branch history information are too complex and slow. Further, the power consumption caused by incessantly, yet necessarily in many conventional approaches, accessing the BTB memory array is simply wasteful.