1. Field of the Invention
The present invention relates to pipelined computer processors having branch target buffers for improving performance of branch instruction execution. The invention is useful for single pipeline architectures and is also useful for superscalar pipelined architectures that have two or more pipelines for processing instructions.
2. Description of Related Art
Computer designers are continually attempting to make computers run faster for higher performance. Computer processors process a series of instructions that are supplied to it from a source such as memory. One way to build faster computers is to design a computer that processes instructions faster.
The original computers were designed to completely process one instruction before beginning the next instruction in the sequence. Major architectural advances that have increased performance include the use of pipelined and superscalar architectures. These architectures introduce higher levels of design complexity and cost in the computer processors, however this additional cost is more than offset by the increased performance of pipelined and superscalar computers.
Performance can also be increased by use of caches in computer architectures. Caches are utilized to store and supply often used information such as data and instructions. Within one clock cycle, a cache can supply the needed information without the memory access that could consume several cycles. One example of a cache that increases performance during a branch instruction is termed a "branch target buffer".
In addition to speed, another concern of processor designers is the processor's compatibility with previously designed processors. If any new computer is to be commercially successful, it must have a base of application programs which it can run when it is introduced in order to be of interest to users. The most economic way to provide such programs is to design the new computer processor to operate with the programs designed for an earlier computer or family of computers. This type of design for compatibility is exemplified by the microprocessors manufactured by INTEL Corporation including the 8086, 8088, 80286, i386.TM., and i486.TM. hereinafter referred to as the INTEL microprocessors.
As mentioned briefly above, the speed of computers is increased by pipelining instructions. A pipelined computer divides instruction processing into a series of steps, or stages, each of which is executable in a single clock cycle. In a non-pipelined computer, each instruction is processed until it is complete and only then does processing begin on the next instruction. In a pipelined computer, several sequential instructions are processed simultaneously in different stages of the pipeline. Processing in the different processing stages may proceed simultaneously in one clock period in separate portions of the computer. The computers based on INTEL microprocessors, such as the i486.TM. microprocessors, pipeline instructions so that each stage of the operation is handled in one clock period. The stages into which instruction processing for an INTEL microprocessor are divided include a prefetch stage for fetching the instruction from wherever it is stored, a first and a second decode stage for decoding the instruction, an execution stage for executing the instruction, and a writeback stage for writing the results of the execution to the registers and memory for later use. Each of the steps is designed to require one clock period. Thus during a first clock period the prefetch portion of the computer fetches an instruction from storage and aligns it so that it is ready for decoding. During a second clock period, the prefetch portion of the computer fetches the next instruction from storage and aligns it, while the first stage decoder portion of the computer decodes the first instruction fetched. During the third clock period, the first instruction fetched is further decoded in the second stage decoder, the second instruction fetched is decoded in the first stage decoder, and another instruction is fetched and aligned in the prefetch stage. Pipelining continues through each of the stages including the execution stage and the writeback stage, and thus the overall speed of computer processing is significantly increased over a non-pipelined computer.
In a superscalar architecture, two or more instructions may be processed simultaneously in one stage. A superscalar computer has two or more processing paths that are capable of simultaneously executing instructions in parallel. In a scalar computer, the same type of instructions would be run serially. It should be apparent that if two or more instructions are run simultaneously, then the computer can process instructions faster.
If a branch instruction, such as a jump, return, or conditional branch, is in the series of instructions, a pipelined computer will suffer a substantial performance penalty on any taken branch unless there is some form of branch prediction. The penalty is caused on a taken branch because the next instructions following in the pipeline must be thrown away, or "flushed." For example, if the microarchitecture has three stages preceding an execution stage, then the penalty will be at least three clock cycles when a branch is taken and not predicted, assuming the branch is resolved in the execution stage. This penalty is paid when the incorrect instructions are flushed from the pipeline and the correct instruction at the actual target address is inserted into the pipeline.
One way to increase the performance of executing a branch instruction is to predict the outcome of the branch instruction, and insert the predicted instruction into the pipeline immediately following the branch instruction. If such a branch prediction mechanism is implemented in a microprocessor, then the penalty is incurred only if the branch is mispredicted. It has been found that a large number of the branches actually do follow the predictions. That this is so can be exemplified by the prevalence of repetitive loops. For example, it may be found that 80% of the branch predictions are correct.
Several types of branch prediction mechanisms have been developed. One type of branch prediction mechanism uses a branch target buffer that stores a plurality of entries including an index to a branch instruction. In addition to the index, each entry may include an instruction address, an instruction opcode, history information, and possibly other data. In a microprocessor utilizing a branch target buffer, the branch prediction mechanism monitors each instruction as it enters into the pipeline. Specifically, each instruction address is monitored, and when the address matches an entry in the branch target buffer, then it is determined that that instruction is a branch instruction that has been taken before. After the entry has been located, the history information is tested to determine whether or not the branch will be predicted to be taken. Typically, the history is determined by a state machine which monitors each branch in the branch target buffer, and allocates bits depending upon whether or not a branch has been taken in the preceding cycles. If the branch is predicted to be taken, then the predicted instructions are inserted into the pipeline. Typically, the branch target entry will have opcodes associated with it for the target instruction, and these instructions are inserted directly into the pipeline. Also associated with the branch target buffer entry is an address that points to the predicted target instruction of the branch. This address is used to fetch additional instructions.
Processing the branch instruction and each following instruction then proceeds down the pipeline for several clock cycles until the branch instruction has completed the execution stage, after which the "takenness" of the branch is known. If the branch is taken, the actual branch target address of the branch will be known. If the branch has been correctly predicted, then execution will continue in accordance with the prediction. However, if the branch has been mispredicted, then the pipeline is flushed and the correct instruction is inserted into the pipeline. In a superscalar computer, which has two or more pipelines through which instructions flow side-by-side, the performance penalty on a misprediction is even greater because at least twice the number of instructions may need to be flushed.
In a superscalar architecture, the designer must make decisions regarding in which pipeline branch instructions are to be permitted. In other words, the designer must decide whether or not to allow branch instructions in one pipeline, another pipeline, or two or more. If, for example, a branch instruction is permitted to be in only the first pipeline, then the capabilities of the superscalar architecture are not being fully utilized.
It would be an advantage to provide an apparatus, including a branch target buffer, which provides increased performance in a superscalar microprocessor and if a branch instruction could be executed in either of the pipelines. It would be an advantage if the silicon space requirements could be reduced, costs could be reduced, and cache coherency problems were avoided. It would also be an advantage if the branch prediction mechanism were compatible with multi-clock instructions, which require at least two clocks to decode, and also require additional code that is in the prefetch stage. An example of such a multi-clock instruction is a prefixed instruction in the INTEL microprocessors.