1. Field of the Invention
The present invention relates to storing branch target addresses in a microprocessor, and more particularly, to methods for caching indirect branch target addresses.
2. Description of the Related Art
Modem microprocessor designers continually attempt to improve the performance of their products. One method to increase the performance of a microprocessor is to increase the clock frequency at which it operates. However, as clock frequencies increase, each functional unit within the microprocessor has less time in which to perform its designated task. Thus, other methods have been employed to further increase performance at a given clock cycle frequency. Parallel execution is one such method. Parallel execution involves executing more than one instruction during each clock cycle. In order to implement parallel execution, many microprocessors use parallel functional units each configured to operate independently and in parallel.
Another method for increasing the performance of a microprocessor at a given clock cycle is out of order execution. Most programs depend upon their instructions executing in a particular order. This order is referred to as "program order". However, certain instructions within the program may execute out of program order as long as the desired functionality is still achieved. Instructions that may be executed out of order are those that do not "depend" upon the instructions around them. Microprocessors that implement out of order execution are able to determine which instructions do not depend on other instructions and therefore may be executed out of order.
There are several potential draw backs to executing instructions out of order. One drawback in particular is the use of branch instructions. Examples of branch instructions include conditional jump instructions (Jcc), unconditional jump instructions (JMP), return instructions (RET), and subroutine call instructions (CALL). These instructions make out of order execution particularly difficult because the microprocessor may be unable to determine whether or not instructions after the branch instruction should be executed.
Another method used to increase a microprocessor's performance at a given clock cycle is pipelining. Pipelining involves dividing the tasks the microprocessor must complete in order to execute an instruction. The tasks are divided into a number of stages that form an instruction processing "pipeline". The stages are typically serial in nature. For example the first stage may be to "fetch" an instruction stored in an instruction cache, while the second stage may be to align the instruction and convey it to decode units for decoding or functional units for execution. Pipelines allow long streams of instructions to be executed in fewer clock cycles. The advantages of a pipelined microprocessor stem from its ability to process more than one instruction at a time. For example, a first instruction may be decoded at the same time a second instruction is being fetched from the instruction cache.
However, as with out of order execution, pipelined microprocessors may suffer performance setbacks as the result of branch instructions. Branch instructions may cause the pipeline "stall" because the initial stages of the pipeline are unable to determine which instructions to fetch until after the branch instruction has been executed. For example, a conditional branch instruction is either "taken" or "not taken". If the branch instruction is not taken, then the instruction immediately following the branch instruction in program order is executed. Alternatively, if the branch instruction is taken, then a different instruction at some offset relative to the branch target address is executed. In pipelined microprocessors, it is difficult for initial or fetch stages of the pipeline to determine whether to fetch the "taken" or "not taken" target instruction until the branch instruction has completed execution. Thus, the microprocessor pipeline may stall or form "bubbles" (i.e., idle clock cycles that propagate through the pipeline) while the initial stages await the results of the branch instruction.
In order to prevent stalls or bubbles from forming in the pipeline, many microprocessors implement a branch prediction scheme. Branch prediction involves predicting whether each branch instruction is taken or not taken before the branch instruction actually completes execution. While a number of different algorithms may be used, most rely upon storing the taken/not taken history of the particular branch instruction. For example, when a branch instruction is executed and taken, branch prediction hardware will store that information so that the next time the branch instruction is fetched, the pipeline will automatically assume the instruction is taken based upon its previous performance.
While storing branch prediction information reduces the probability of a stall or bubble forming, it has disadvantages also. One disadvantage is the complex hardware required to maintain and update the branch history information. A second disadvantage is the space required to store the branch history information. Die space within microprocessors is a relatively scarce resource, thus using large amounts of space to store branch predictions is disadvantageous. As instruction caches grow in size, more branch instructions may be stored therein. This in turn requires more storage space for branch history prediction information. Furthermore, modern software tends to have a high concentration of branch instruction (e.g., one out of every four instructions). This high concentration, coupled with the inability to know exactly how many branches are stored in the instruction cache, typically results in large amounts of die space being devoted to storing branch history information.
For these reasons, a method for efficiently storing branch prediction and history information is desired. In particular, a method for reducing the amount of storage space needed for branch history and prediction information is particularly desirable.