1. Field of the Invention
The present invention pertains to computer systems. More particularly, this invention relates to a synthetic dynamic branch prediction in a computer system that is both accurate and cost effective.
2. Description of the Related Art
As is known, a computer system typically includes one or more central processing units (CPUs) or processors. The processor typically executes instructions of software programs to perform a variety of tasks in the computer system. The instructions of the software programs are in machine language form (i.e., binary form) because the processor can only understand and interpret machine language. The machine language instructions are referred to as machine code or object code below.
Because the machine language is very difficult to write and understand, high level source programming languages (such as C and Fortran) have been developed to code or define the instructions of a software program in a humanly readable fashion. Such a source programming language software program is referred to as source code. The source code needs to be converted or translated into the machine code by a compiler program before being executed by the processor.
The earlier prior art processors are typically single instruction single data (SISD) processors. A SISD processor typically receives a single instruction stream and a single data stream. The SISD processor sequentially executes each instruction, acting on data in a single storage area. This SISD processor architecture, however, presents an obstacle to achieving high processing throughput.
To increase the processing throughput of a processor, many parallel processing architectures have been developed. One type of such parallel processing model is known as pipeline processing. In a simple pipelined processor, the pipeline typically includes several stages. These stages, for example, may include a fetch stage, a decode stage, a execute stage, and a write-back stage. In such a pipelined processor, instructions are executed sequentially through these stages of the pipeline in an overlapping fashion.
However, the performance of the pipelined processors depends strongly on the efficiency with which branch operations are handled. The branch operations are referred to as branches below. As is known, branches typically cause lengthy breaks in the pipeline by redirecting instruction flow during program execution. This is typically due to the fact that when a branch instruction is fetched, the address of the instruction that will be executed next is not immediately known. Hence, the fetch stage must stall and wait for the branch target to be calculated. The target of a branch is generally resolved during the execute stage. Therefore, the fetch unit stalls while the branch advances through the decode and execute stages of the pipeline. After the branch instruction has completed execution, the branch direction is known and instruction fetch can safely resume at the correct target address. Stalling the instruction fetch for each branch introduces a large number of empty cycles, referred to as bubbles, into the pipeline. These bubbles severely limit the performance of pipelined processors by restricting the utilization of the processor resources. The performance problem becomes amplified as pipeline depth and instruction issue width of processors are increased. The instruction issue width refers to the number of instructions that the processor can execute per cycle.
In order to minimize the pipeline breaks caused by branch instructions, branch prediction is employed. Branch prediction is an effective approach for dealing with branches in pipelined processors. Branch prediction guesses the targets of branches in order to speculatively continue executing instructions while a branch target is being calculated. In the cases where the prediction is correct, all the speculative instructions are useful instructions, and pipeline bubbles are completely eliminated. On the other hand, incorrect prediction results in the normal pipeline bubbles while the branch target is resolved, as well as additional delay to remove all instructions that were improperly executed. Clearly, the accuracy of the branch prediction strategy is central to processor performance.
Traditionally, the branch prediction is accomplished in one of two ways, static prediction at compile-time via compiler analysis or dynamic prediction at run-time via special hardware structures. Static branch prediction utilizes information available at compile time to make predictions. In general, the compiler is responsible for static branch prediction. The most common static branch prediction approach is to use profile information. FIGS. 1 and 2 show this implementation. FIG. 1 shows the instruction format for a branch instruction and FIG. 2 is a state diagram illustrating the process of the static branch prediction approach using profile information.
As can be seen from FIG. 1, a prediction bit is provided for each branch instruction. To set the prediction bit for a branch instruction, the source code containing the branch instruction is first converted into the machine code by a compiler (see the pre-compilation stage 20 of FIG. 2). At this time, the prediction bit of the branch instruction is not set. Then the machine code is executed with sample input data (see the code execution stage 21 of FIG. 2) to obtain branch statistics data of the branch instruction. The branch statistics data indicates the times that the branch is taken (i.e., to branch) and the times that the branch is not taken (i.e., not to branch). The source code is then compiled at the compilation stage 22 with the statistics data to set the prediction bit of the branch instruction accordingly. If the statistics data indicates that the branch is taken in the majority of occasion, then the compiler sets the prediction bit of the branch instruction to taken at the compilation stage 22. Otherwise, the compiler sets the prediction bit to not taken. The major advantage of the static branch prediction is low cost. Another advantage is that the prediction is realized without requiring hardware. Branch state information is not required during code execution because the prediction is explicitly specified by the program itself.
Disadvantages are, however, associated with the static branch prediction. One disadvantage is that the prediction is fixed at compile-time, thus it cannot vary during program execution. As a result, the accuracy of static branch prediction is inherently limited for unbiased branches. Another disadvantage is that each branch instruction requires an extra bit for the prediction.
On the other hand, dynamic branch prediction utilizes run-time behavior to make predictions. In general, a hardware structure is provided to maintain branch history. Based on the current history, a prediction is made for each branch encountered in the program. FIG. 3 shows a prior art scheme of dynamic branch prediction. As can be seen from FIG. 3, a 2-bit counter 31 is provided that includes a number of entries. When a branch 30a of a program 30 is executed, the branch 30a is hashed to its corresponding entry 31a of the 2-bit counter 31. The value stored in the entry 31a determines whether the branch 30a should be taken or not. If the branch 30a is taken, the counter value stored in the entry 31a is incremented. If the branch 30a is not taken, the counter value stored in the entry 31a is decremented. When the counter value of an entry of the counter 31 reaches three, the value will remain at three when further incremented. The major advantage of the dynamic branch prediction is the increased accuracy. The use of run-time information and the ability to predict a branch differently during various phases of execution allow dynamic branch prediction techniques to enjoy significantly higher accuracy than that of the static branch prediction.
However, the dynamic branch prediction is not without disadvantages. One major disadvantage of the dynamic branch prediction techniques is the cost. This is due to the fact that dynamic branch prediction techniques typically utilize relatively large amounts of hardware and provide difficult challenges for circuit designers to meet cycle time goals.