1. Field of the Invention
This invention is related to the field of microprocessors and, more particularly, to branch prediction mechanisms within microprocessors.
2. Description of the Related Art
Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. As used herein, the term "clock cycle" refers to an interval of time accorded to various stages of an instruction processing pipeline within the microprocessor. Storage devices (e.g. registers and arrays) capture their values according to the clock cycle. For example, a storage device may capture a value according to a rising or falling edge of a clock signal defining the clock cycle. The storage device then stores the value until the subsequent rising or falling edge of the clock signal, respectively. The term "instruction processing pipeline" is used herein to refer to the logic circuits employed to process instructions in a pipelined fashion. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises fetching the instruction, decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction.
An important feature of a superscalar microprocessor (and a superpipelined microprocessor as well) is its branch prediction mechanism. The branch prediction mechanism indicates a predicted direction (taken or not-taken) for a branch instruction, allowing subsequent instruction fetching to continue within the predicted instruction stream indicated by the branch prediction. A branch instruction is an instruction which causes subsequent instructions to be fetched from one of at least two addresses: a sequential address identifying an instruction stream beginning with instructions which directly follow the branch instruction; and a target address identifying an instruction stream beginning at an arbitrary location in memory. Unconditional branch instructions always branch to the target address, while conditional branch instructions may select either the sequential or the target address based on the outcome of a prior instruction. Instructions from the predicted instruction stream may be speculatively executed prior to execution of the branch instruction, and in any case are placed into the instruction processing pipeline prior to execution of the branch instruction. If the predicted instruction stream is correct, then the number of instructions executed per clock cycle is advantageously increased. However, if the predicted instruction stream is incorrect (i.e. one or more branch instructions are predicted incorrectly), then the instructions from the incorrectly predicted instruction stream are discarded from the instruction processing pipeline and the number of instructions executed per clock cycle is decreased.
In order to be effective, the branch prediction mechanism must be highly accurate such that the predicted instruction stream is correct as often as possible. Typically, increasing the accuracy of the branch prediction mechanism is achieved by increasing the complexity of the branch prediction mechanism. For example, the history of branch predictions may be represented by one or more bits stored in the branch prediction unit. By recording the history of the behavior of a branch, more accurate predictions may be made as to its likely behavior in the future. Generally, by increasing the number of bits used to track branch prediction history, a more complete history may be recorded and the accuracy of branch predictions may likewise be increased.
Frequently, two bits are used to represent branch prediction history. Using two bits, four prediction states are typically represented: strongly taken, weakly taken, strongly not taken, and weakly not taken. By having four representative states, the relative tendencies of branch behavior may be recorded. If a branch is almost always taken, the predicted state will gravitate toward the strongly taken state. However, if a branch is taken only slightly more often than not, the predicted state will gravitate toward the weakly taken state. Likewise, if a branch is almost always not taken, the predicted state will gravitate toward the strongly not taken state. If the branch is not taken slightly more often than it is taken, the predicted state will gravitate toward the weakly not taken state.
Also common is the use of one bit to represent branch prediction history. Using one bit, two states are typically represented: taken and not taken. Because only one bit is used, the relative tendencies of a branch are not recorded. Either a branch is predicted taken or it is predicted not taken. Consequently, the accuracy of predictions is typically poorer than that of the two bit mechanism as will be discussed below.
Tracking branch history may be used in conjunction with a variety of structures. For example, branch history tracking may be used with a branch target buffer in which the target addresses of predicted branches are kept in a high speed cache. By utilizing such a structure, delays associated with calculating branch target addresses may be reduced. Another example would be using branch history tracking with a branch target cache in which target instructions themselves are stored in a high speed cache. This method reduces delays associated with fetching the required instructions from a more remotely located storage device. Other embodiments of branch history tracking are contemplated as well.
One example of a branch prediction mechanism is a cache-line approach in which branch predictions are stored corresponding to a particular cache line of instruction bytes in an instruction cache. A cache line is a number of contiguous bytes that are treated as a unit for allocation and deallocation of storage space within a cache. When the instruction cache line is fetched, the corresponding branch predictions are also fetched. Furthermore, when the particular cache line is discarded, the corresponding branch predictions are discarded as well. The cache line is aligned in memory. A cache-line based branch prediction mechanism may be made more accurate by storing a larger number of branch predictions for each cache line. A given cache line may include multiple branch instructions, each of which is represented by a different branch prediction. Therefore, more branch predictions allocated to a cache line allows for more branch instructions to be represented and predicted by the branch prediction mechanism. A branch instruction that cannot be represented within the branch prediction mechanism is not predicted, and subsequently a "misprediction" may be detected if the branch is found to be taken. However, the complexity of the branch prediction mechanism is increased by the need to select between additional branch predictions. As used herein, a "branch prediction" is a value that may be interpreted by the branch prediction mechanism as a prediction of whether or not a branch instruction is taken or not taken. Furthermore, a branch prediction may include the target address. For cache-line based branch prediction mechanisms, a prediction of a sequential line to the cache line being fetched is a branch prediction when no branch instructions are within the instructions being fetched from the cache line.
A problem related to increasing the complexity of the branch prediction mechanism is that the increased complexity generally requires an increased amount of time to form the branch prediction. For example, selecting among multiple branch predictions may require a substantial amount of time. The offset of the fetch address identifies the first byte being fetched within the cache line: a branch prediction for a branch instruction prior to the offset should not be selected. The offset of the fetch address within the cache line may need to be compared to the offset of the branch instructions represented by the branch predictions stored for the cache line in order to determine which branch prediction to use. The branch prediction corresponding to a branch instruction subsequent to the fetch address offset and nearer to the fetch address offset than other branch instructions that are subsequent to the fetch address offset should be selected. As the number of branch predictions is increased, the complexity (and time required) for the selection logic increases. The increased time may result in the introduction of one or more "bubbles" into the instruction processing pipeline during clock cycles that instructions cannot be fetched due to a lack of a branch prediction corresponding to a previous fetch address. The bubble occupies various stages in the instruction processing pipeline during subsequent clock cycles, and no work occurs at the stage including the bubble because no instructions are included in the bubble. Performance of the microprocessor may thereby be decreased.
As mentioned above, while using more bits to record the history of a branch may increase the accuracy of predictions, the disadvantage of such a technique is the increased storage required for the additional bits increases the physical size and cost of the branch prediction storage.