The processing of a single instruction by a computer system is performed in a number of different stages, such as instruction cache fetch, instruction decode and instruction execution. Many modern computer systems utilize an instruction pipeline to increase the speed of processing of an instruction. In a pipelined computer design, the various stages of instruction processing are arranged in series so that each one of the stages can process an instruction independently of the other stages during each cycle of pipeline operation and transmit its processing results to a next succeeding stage in the series for processing in a subsequent cycle. Of course, each stage receives as an input the output of a preceding stage of the series.
In this manner, a computer system does not have to wait for an instruction to be completely processed before fetching and processing a next instruction. For example, if the instruction pipeline comprises three stages, a first instruction that has been processed in previous cycles by each of the first and second stages will be processed in a current cycle by the third stage. During the same cycle, the second stage can process a second instruction that has already been processed by the first stage, and the first stage can process a third instruction, and so on. Pipelining of instructions is a much more efficient method of processing instructions in comparison with waiting for a single instruction to be completely processed before beginning the processing of a second instruction.
In a normal flow of a computer program, it is easy to know which instruction is to enter the pipeline next. In most instances, it is the next sequentially numbered instruction in the program that is to be processed so that, for example, instruction 101 will enter the pipeline in the cycle after instruction 100. An exception to this normal flow of control within a computer program is a branch instruction that instructs the computer system to fetch a next instruction that is out of the normal sequence of the numbered instructions.
For example, instruction 101 may be a conditional branch instruction that instructs the computer system to process instruction 200 if a certain condition is satisfied and to process instruction 102 if the condition is not satisfied. Accordingly, the next instruction to enter the pipeline will not be known until instruction 101 is processed by the execution stage of the pipeline to determine the status of the condition for selection of the next instruction. This results in a "bubble" in the pipeline behind the branch instruction since additional instructions cannot be entered into the pipeline during subsequent cycles until the branch instruction has flowed to the execution stage, which is typically at the end of the pipeline, and the next instruction, 200 or 102, that is to enter the pipeline becomes known.
To minimize bubbles, the prior art has provided branch prediction mechanisms to predict, early in the pipeline, as, for example, at the instruction decode stage, whether a branch will be taken and to fetch the predicted instruction from the instruction cache. Typically, the execution stage includes a device, such as a comparator, to compare each instruction input to the execution stage to the instruction that should be executed. Thus, if the branch prediction mechanism mispredicts the branch, the execution stage comparison will detect the wrong instruction at its input and issue a signal to the branch prediction mechanism to fetch the proper instruction. The pipeline is then backed up to the branch instruction, for processing with the proper branched to instruction following the branch instruction into the pipeline.
As should be understood, the speed up in the operation of the pipeline accomplished by the use of the branch prediction mechanism will be a function of the accuracy of the branch predictions made by the mechanism. However, despite the elimination of the relatively large bubbles for all correct conditional branch predictions, there is a certain amount of latency introduced into the pipeline by the branch prediction mechanism. More specifically, an index to the instruction cache for a next cycle of operation of the computer system pipeline is not available for input to the address input of the instruction cache until the branch prediction mechanism processes a current instruction to determine whether the current instruction is a branch instruction and thereafter to predict whether the branch is to be taken. This can take several cycles of pipeline operation for instruction cache fetch, instruction decode, and branch prediction, before an index for the next instruction is available for input to the address input of the instruction cache to continue pipeline operation.
The reduction in instruction cache bandwidth caused by the latency of the branch prediction mechanism can slow down the speed of operation of the execution stage. In other words, the pipeline might still not be able to deliver instructions to the execution stage as fast as the execution stage is able to process instructions since a small bubble will be introduced into the pipeline after each instruction fetch due to the latency of the branch prediction mechanism. The instruction cache itself can also introduce a latency into the pipeline since an advantageous size for the instruction cache may result in the need for several cycles of pipeline operation just to fetch an instruction.
Ideally, the pipeline should operate to deliver instructions to the execution stage at a rate that enables the execution stage to operate at its maximum speed. The total latency introduced into the pipeline by the instruction cache fetch and branch prediction has become a serious problem as the speed of instruction execution that can be achieved in an execution stage has increased. The execution stage will sit idle during each cycle that an instruction is not available for execution, resulting in a waste of computer resources. For example, if the total latency of instruction cache fetch and branch prediction is six nsec. and the execution stage can execute an instruction in two nsec., the execution stage will sit idle for four nsec. between the delivery of successive instructions.
In an attempt to minimize the branch prediction latency, a next instruction prediction approach has been suggested by the prior art as a partial solution. This is to simply assume a flow through to the next in number instruction for each fetch and to fetch that instruction prior to completion of processing by the branch prediction mechanism so that a next instruction is available for input to the pipeline as soon as possible. In some prior art devices, this is implemented by fetching two instructions at a time.
In other words, the prior art approach always assumes that no branch is taken. The branch prediction mechanism would then do a comparison similar to the comparison done by the execution stage, to determine whether the instruction fetched from the instruction cache in each cycle is the instruction that was predicted by the branch prediction mechanism. Again, the pipeline would be backed up to the branch instruction if the next instruction prediction was incorrect. With this approach, the effect of the latency introduced by the branch prediction mechanism can be overcome, at least for each flow through after a branch instruction. While this scheme keeps the pipeline full for all sequential instructions, no advantage is derived at the instruction fetch stage due to the operation of the branch prediction mechanism further downstream at the instruction decode stage.
Accordingly, the prior art has also suggested building a look-up table, typically an extension to each instruction cache entry indicating whether that instruction is a branch instruction and, if so, what the branch prediction mechanism predicted the last time the instruction was processed through the pipeline. The look-up information can, for example, comprise a pointer to the next instruction.
For each branch instruction, the pointer points either to a flow through instruction, when the branch prediction mechanism last predicted that the branch was not taken or to the branched to instruction, when the branch prediction mechanism last predicted that the branch was taken. For non-branch instructions, the pointer simply points to the next in number instruction (flow through). The look up table is filled by using the branch prediction mechanism output as write data to the look up table.
In this approach, however, the look-up table comprises an extension of the instruction cache. Thus, there is no speed advantage in the look up operation and this scheme does not entirely eliminate the latency introduced into the pipeline by the instruction cache fetch and the branch prediction mechanism. In addition, classical branch prediction in the computer art is typically limited to branch taken and flow through predictions for a conditional branch. Thus, the look up table would not contain information for other types of branches such as a subroutine return instruction. Moreover, the necessity of having an entry in the look-up table corresponding to each instruction in the instruction cache uses an inordinate amount of real estate on the chip or chips used to implement the pipeline.
Accordingly, there is a need for improvement in a scheme for predicting a next instruction index for the instruction cache, prior to completion of branch prediction processing of a previous instruction, so as to obtain an increase in instruction bandwidth sufficient to accommodate the speed of execution of the execution stage.