Many modern computing systems utilize a processor having a pipelined architecture to increase instruction throughput. In theory, pipelined processors can execute one instruction per machine cycle when an well-ordered, sequential instruction stream is being executed. This is accomplished even though the instruction itself may implicate or require a number of separate micro-instructions to be effectuated. Pipelined processors operate by breaking up the execution of an instruction into several stages that each require one machine cycle to complete. For example, in a typical system, an instruction could require many machine cycles to complete (fetch, decode, ALU operations, etc.) Latency is reduced in pipelined processors by initiating the processing of a second instruction before the actual execution of the first instruction is completed. In the above example, in fact, multiple instructions can be in various stages of processing at any given time. Thus, the overall instruction execution latency of the system (which, in general, can be thought of as the delay between the time a sequence of instructions is initiated, and the time it is finished executing) can be significantly reduced.
The above architecture works well when program execution follows a sequential flow path. In other words, this model is premised on a sequential model of program execution, where each instruction in a program is usually the one immediately in memory following the one just executed. A critical requirement and feature of programs, however, is the ability to “branch” or re-direct program execution flow to another set of instructions; using branch instructions conditional transfer of control can be made to some other path in the executing program different from the current one. However, this path may or may not coincide with the next immediate set of instructions following the instruction that was just executed.
In general, prior art processors have a single address register for instructions that are to be executed, including a branch target address. The branch target address is an address indicating the destination address of the branch instruction. The branch instruction is executed quickly by the processor if the correct target address for the branch instruction is already stored in the address register. However, branch instructions can occur arbitrarily within any particular program, and it is not possible to predict with certainty ahead of time whether program flow will be re-directed. Various techniques are known in the art for guessing about the outcome of a branch instruction, so that, if flow is to be directed to another set of instructions, the correct target address can be pre-calculated, and a corresponding set of instructions can be prefetched and loaded in advance from memory to reduce memory access latencies. In general, since memory accesses are effectuated much slower than pipeline operations, execution can be delayed pending retrieval of the next instruction.
Sometimes, however, the guess about the branch outcome is incorrect, and this can cause a “bubble”, or a pipeline stall. A bubble or stall occurs, in general, when the pipeline contains instructions that do not represent the desired program flow (i.e., such as from an incorrectly predicted branch outcome). A significant time penalty is thus incurred from having to squash the erroneous instruction, flush the pipeline and re-load it with the correct instruction sequence. Depending on the size of the pipeline, this penalty can be quite large; to a significant degree, therefore, the desire for long pipeline designs (to increase effective instruction throughput) is counterbalanced by the stall penalty that occurs when such pipeline has to be flushed and re-loaded. Thus, significant effort has been expended in researching, designing and implementing intelligent mechanisms for reducing branch instruction latency.
To analyze branch instruction latency, it is helpful to think of a branch instruction as consisting of three operational steps:    (1) deciding the branch outcome    (2) calculating the branch target address (i.e., the location of the instruction that needs to be loaded)    (3) transferring control so that the correct instruction is executed next
In most systems, steps (1) and (2) must be resolved in this order by a branch instruction. Branch instructions also fall generally into two classes: conditional, and unconditional. When the branch is always taken it is referred to as an unconditional branch, and the above three operational steps are not required. A conditional branch is taken depending on the result of step (1) above. If the branch is not taken, the next sequential instruction is fetched and executed. If the branch is taken, the branch target address is calculated at step (2), and then control is transferred to such path at step (3). A good description of the state of the art in branch prediction can be found generally in section 4.3 of a textbook entitled Computer Architecture: A Quantitative Approach, 2nd edition, by Patterson and Hennessy; pages 262–278 are incorporated by reference herein.
In general, the number of penalty cycles associated with a branch instruction can be broken down into two categories: (1) fetch latency of the target instruction from decode of branch; this generally refers to the time required to fetch and place the target instruction of the branch into the pipeline after it has been identified; (2) latency of the branch condition generation; this refers generally to the process by which it is determined if the branch is actually taken or not-taken. Within a particular system it is usually more important to reduce category (1) penalties since they affect both conditional and unconditional branches, while the category (2) penalties are only associated with conditional branches. Moreover, category (2) penalties can be ameliorated to some extent by well-known techniques, including branch prediction. For example, in U.S. Pat. No. 5,742,804 to Yeh et. al., also incorporated by reference herein, a compiler inserts a “branch prediction instruction” sometime before an actual branch instruction. This prediction instruction also specifies the target address of the branch, to further save execution time. Instructions are pre-fetched in accordance with the hint provided by the prediction instruction, so that they will be ready for execution when control is transferred. The prediction itself on the branch outcome is made based on information acquired by the compiler at run time. There does not seem to be very optimal handling of mis-predictions in Yeh, however, and these “misses” can be costly from a branch penalty perspective. Accordingly, the approach shown there also appears to have serious limitations.
Looking more specifically at the breakdown of the category (1) time penalty within a particular pipelined computing system, it can be seen to consist of the following: reading the branch operand (0 to 1 cycles); calculating the branch target address (1–2 cycles); and accessing the instruction cache and putting the target instruction into the decode stage of the pipeline (1–2 cycles). Thus, in a worst case scenario, a branch instruction latency of 5 cycles can be incurred. In some types of programs where branch instructions are executed with some regularity (i.e., 20% of the time) it is apparent that the average branch instruction penalty can be quite high (an average of 1 cycle per instruction).
Various mechanisms have been proposed for minimizing the actual execution time latency for branch instructions. For instance, one approach used in the prior art is to compute the branch address while the branch instruction is decoded. This can reduce the average branch instruction cycle, but comes at the cost of an additional address adder; this consumes area and power that is preferably used for other functions.
Another approach used in the prior art consists of a target instruction history buffer. An example of this is shown in U.S. Pat. Nos. 4,725,947, 4,763,245 and 5,794,027 incorporated by reference. In this type of system, each target instruction entry in a history buffer is associated with a program counter of a branch instruction executed in the past. When a branch is executed, an entry is filled by the appropriate target instruction. The next time when the branch is in the decoding stage, the branch target instruction can be prepared by matching the program counter to such entry in the history buffer. To increase the useful hit ratio of this approach, a large number of entries must be kept around, and for a long time. This, too, requires an undesirable amount of silicon area and power. Moreover, the matching mechanism itself can be a potential source of delay if there are a larger number of entries to compare against.
Yet another approach is discussed in the following: (1) an article tided “Implementation of the PIPE Processor by Farrens and Pleszkun on pages 65–70 of the January 1991 edition of the journal Computer; and (2) an article tided “A Simulation Study of Architectural Data Queues and Prepare-T0-Branch Instruction,” by Young and Goodman on pages 544–549 of the October 1984 IEEE International Conference on Computer Design: VLSI in Computers, both of which are hereby incorporated by reference. In the scheme described in these references, a form of delayed branch is proposed by using a prepare-to-branch (PTB) instruction. The PTB instruction is inserted before the branch instruction, decides the branch outcome, and then specifies a delay before transfer of control. By ensuring that the delay is sufficiently large to guarantee the branch condition will have been evaluated when the instruction is completed, the pipeline is kept full. A problem with this approach, however, lies in the fact that the latency caused by the target address calculation (step 2) cannot be entirely accommodated, because it can be quite large. U.S. Pat. No. 5,615,386 to Amerson et. al., also incorporated by reference herein, also specifies the use of a PTB instruction. This reference also mentions that branch execution can be improved by separating the target address calculation (step (2)) from the comparison operation (step (1)). By computing the branch address out of order, latencies associated with branches can be further reduced. This reference discusses a number of common approaches, but is limited by the fact that: (1) It does not use a folded compare approach; thus separate compare and branch instructions are required, and this increases code size, dynamic execution time, etc; (2) the compare result must be recognized by way of an internal flag, instead of a register, and this reduces flexibility; (3) without using a register, such as a link register, execution of function subroutines is more challenging because it is more difficult to save/switch contexts; (4) the disclosure also relies on a kind of complex nomination process, whereby the execution of a loop effects the prediction weighting for a subsequent related loop.
A related problem in the art arises from the fact that there are often multiple branches included in the program flow. In such case, it is necessary to update the target address in the address register for each branch instruction. This updating requires additional time and thus slows down program execution.