1. Field of the Invention
The present invention relates generally to microprocessors, and more particularly to a circuit and method for preloading a prediction circuitry within the microprocessor.
2. Description of the Relevant Art
Microprocessor based computer systems have become prevalent in today's society. The increasing use of computer systems in large part is related to advances in semiconductor chip technology, which technology is increasing circuit densities so that microprocessors can be implemented on one or a very small number of semiconductor chips. Additionally, speeds within microprocessors are increasing with the use of scalar computation with superscalar technology being the next logical step in the evolution of microprocessor. The term superscalar describes an implementation that improves performance by a concurrent execution of scalar instructions. Scalar instructions are the type of instruction typically found in general purpose microprocessors. Using today's semiconductor processing technology, a single microprocessor chip can incorporate high performance techniques that were once applicable only to large scale scientific processors.
Microprocessors run application programs. An application program comprises a group of instructions. In running application programs, microprocessors receive and execute instructions in some sequence. There are several steps generally performed by the microprocessor in executing a single instruction, including: fetching the instruction, decoding the instruction, assembling the operands required by the instruction, performing the operations specified by the instructions, and writing the results of the instruction to storage. These steps are controlled by a periodic clock signal. The period of the clock signal is the processor cycle time.
The time taken by a microprocessor to complete a program is determined by at least three factors: the number of instructions required to execute the program, the average number of processor cycles required to execute an instruction, and the processor cycle time. Microprocessor performance is improved by reducing the time taken by the microprocessor to complete the program, which dictates reducing one or more of these three factors.
One way to improve the performance of microprocessors is by overlapping the steps of different instructions, using a technique called pipelining. In pipelining, the various steps of instruction execution described above are performed by independent units called pipelined stages. Pipelining reduces the average number of cycles to execute an instruction, though not the total amount of time required to execute an instruction, by overlapping instructions and thus permitting the processor to handle more than one instruction at a time. Pipelining reduces the average number of cycles per instruction by as much as a factor of three. However, when executing a conditional branch instruction, the pipeline may sometimes stall until the result (resolution) of the conditional branch operation is known and the correct next instruction is fetched for execution. This stall is known as branch delay penalty and is a limiting factor in the speed enhancing effects of pipelining.
A typical pipelined scalar microprocessor executes one instruction per processor cycle. A superscalar microprocessor further reduces the average number of cycles per instruction beyond what is possible in a pipelined scalar processor, by concurrent execution of several instructions in different pipelines. While superscalar processors are simple in theory, there is more to achieving increased performance than simply increasing the number of pipelines. Increasing the number of pipelines makes it possible to execute more than one instruction per cycle, but there is no guarantee that any given sequence instructions can take advantage of the capability. Instructions are not always independent of one another, but are often dependent. Instruction dependencies can be either data dependent or control dependent. A control dependency occurs when a control decision, such as for example, a conditional branch decision must be made before subsequent instructions can be executed.
Branch prediction mechanisms are often employed in superscalar microprocessor to predict the outcome of a conditional branch before its resolution. Once a branch prediction is made the microprocessor pursues the likely execution path prior to decode and subsequent execution of the conditional branch instruction. At any point within the path of instruction execution, if the microprocessor determines that a prior prediction was incorrect, the microprocessor backs up in the instruction stream and proceeds down the correct path. There is a penalty from employing branch prediction mechanisms within a microprocessor. The penalty relates to wasted time associated with instructions completed after the conditional branch is predicted but before the branch outcome is actually determined. These completed instructions are discarded after a branch misprediction, and the time the microprocessor spent executing them is wasted.
Branch prediction can be either static or dynamic. In static branch prediction, each branch instruction encountered is assumed to be either "taken" or "not taken." Generally, the "not taken" branches are usually assumed, and prediction is correct more often than not. However, when there is a branch misprediction, the instruction pipeline has to be flushed of the instructions currently under execution. In this case, the execution of instructions is terminated and the pipeline is reloaded with a new set of instructions corresponding to the target branch address of the misdirected branch instruction.
Dynamic branch prediction mechanisms generally include a branch prediction unit. Several different dynamic branch prediction mechanisms have been studied extensively. One mechanism involves a technique referred to as bimodal branch prediction. In bimodal branch prediction, a prediction is made based on the direction the particular branch went the last few times the particular branch was executed. It is possible that more accurate predictions can be made using more history for the branch instruction. Another mechanism considers the history of each branch independently and takes advantage of repetitive patterns. This technique is referred to a local branch prediction. Another technique uses the combined history of all recent branches in making a prediction. This technique is often referred to as global branch prediction.
In the global branch prediction technique, a dedicated N-bit shift register is used to record the resolution of the most recent N conditional branches. When a branch instruction is encountered, the contents of the shift register are subsequently used to access, directly or indirectly, a branch history table that stores a plurality of two bit counters. As will be more fully described below, the most significant bit of the accessed two bit counter defines the prediction for the encountered branch instruction. The prediction (logical one for taken or logical zero for not taken) is then shifted into the shift register. After resolution of the branch instruction, the appropriate counter in the table is incremented, if the branch instruction is resolved as taken. Likewise, for each not taken branch, the appropriate counter is decremented. The counter is saturating such that the counter is not decremented past zero nor incremented past three. Furthermore, the contents of the shift register is corrected in the event of misdirection.
As noted above, the most significant counter bit determines the prediction. Repeatedly taken branches will be predicted to be taken, and repeatedly not taken branches will be predicted to be not taken. By using the two bit counter, a prediction scheme can tolerate a branch going an unusual direction one time and keep predicting the usual branch direction.
A variation of the global branch prediction involves the use of the branch instruction address or program counter (pc) of the branch instruction. This technique is often referred to as global branch prediction with index selection. In this scheme, the branch history table is indexed with a concatenation of the shift register contents and the pc. Another variation of the global branch prediction technique involves XORing the pc with the contents of the shift register, the results of which are used to access the branch history table for a particular counter. This technique is often referred to as global prediction with index sharing.
Most hardware prediction schemes use one of the global prediction techniques discussed above for predicting branch behavior. While this is an effective and efficient method, and it requires a relatively small amount of integrated circuit real estate, there are classes of algorithms where application program behavior can be accurately predicated a priori. Examples would include digital signal processing behavior including many commonly used algorithms such as fast fourier transforms, discrete cosine transforms, finite impulse response filters, infinite impulse response filters, adaptive filters, and many more. These types of software processing are frequently used as building blocks in constructing many commercial software based application programs.
In these types of algorithms, it is often noted that a portion of code will loop X number of times each time it is executed or called with a jump back to an earlier location the first X-1 number of times. At the X.sup.th time, the code reaches the statement that it will continue and go on to other sections. In cases like this, the prediction unit may mispredict a branch or similar decision the first Y number of times due to the behavior of previously executed branch instructions. The mispredictions cause a subsequent delay while the microprocessor prepares the non-predicted address and jumps there.