1. Field of the Invention
The present invention relates to predicting addresses of future instructions in a computer instruction stream, and more particularly to a system that predicts the address of an instruction following a branch instruction that concurrently performs a fast branch prediction operation and a slower, more-accurate branch prediction operation.
2. Related Art
Early computers generally processed instructions one at a time, with each instruction being processed in four sequential stages: instruction fetch, instruction decode, execute and result write-back. Within such early computers, different logic blocks performed each processing stage, and each logic block waited until all the preceding logic blocks completed before performing its operation.
To improve efficiency, processor designers now overlap operation of the processing stages. This enables a processor to operate on several instructions simultaneously. During a given time period, the fetch, decode, execute and write-back logic stages process different sequential instructions in a computer's instruction stream at the same time. At the end of each clock period, the result of each processing stage proceeds to the next processing stage.
Processors that use this technique of overlapping processor stages are known as "pipelined" processors. Some processors further divide each stage into sub-stages for additional performance improvement. Such processors are referred to as "deeply pipelined" processors.
In order for a pipelined processor to operate efficiently, an instruction fetch unit at the head of the pipeline must continually provide the pipeline with a stream of processor instructions. However, branch instructions within an instruction stream prevent the instruction fetch unit from fetching subsequent instructions until the branch condition is fully resolved. In pipelined processors, the branch condition will not be fully resolved until the branch condition reaches the instruction execution stage near the end of the processor pipeline. Hence, the instruction fetch unit will stall when an unresolved branch condition prevents the instruction fetch unit from knowing which instruction to fetch next.
To alleviate this problem, some pipelined processors use branch prediction mechanisms to predict the outcome of branch instructions. This typically involves predicting the target of a branch instruction as well as predicting whether the branch is taken or not. These predictions are used to determine a predicted path for the instruction stream in order to fetch subsequent instructions. When a branch prediction mechanism predicts the outcome of a branch instruction, and the processor executes subsequent instructions along the predicted path, the processor is said to have "speculatively executed" along the predicted instruction path. During speculative execution, the processor is performing useful work if the branch instruction was predicted correctly. However, if the branch prediction mechanism mispredicted the result of the branch instruction, the processor is speculatively executing instructions down the wrong path and is not performing useful work. When the processor eventually detects the mispredicted branch, the processor must flush all the speculatively executed instructions and restart execution from the correct address.
As processor cycle times continue to decrease, the branch prediction critical path must be modified so that it can operate with the decreased cycle time. This can be accomplished by either (1) simplifying the branch prediction architecture and/or reducing the size of branch prediction tables and related structures so that a branch can be predicted within a reduced cycle time, or (2) extending the branch prediction operation over more cycles.
FIG. 1 illustrates fetch pipeline execution timing for a system with single-cycle branch prediction. In FIG. 1, the operations associated with a given instruction are represented by rows. For example, the first row represents pipeline stages associated with fetch group one. (A fetch group is a block of consecutive instructions that is retrieved from a computer system's memory and stored in the computer system's instruction cache.) The operations associated with fetch group one in the first row include address generation 102, instruction I-cache-0 (I-cache-0) latency 104, I-cache-1 latency 106 and I-cache-2 latency 108. The operations associated with fetch group two in the second row include address generation 110, I-cache-0 latency 112, I-cache-1 latency 114 and I-cache latency 118. The operations associated with fetch group three in the third row include address generation 120, I-cache-0 latency 122, I-cache-1 latency 124 and I-cache-2 latency 126.
During an address generation stage, the computer system generates a predicted address for the next instruction. This predicted address may be a predicted branch target address, or it may be other addresses (as will be described below). Once this predicted address is generated, it is used to retrieve an instruction from the I-cache in the next three successive pipeline stages.
In the example illustrated in FIG. 1, the address generation stages 102, 110 and 120 take a single clock cycle. This works well for computer systems with long cycle times. However, as cycle times get progressively shorter, the address generation stage must be greatly simplified and/or the size of lookup tables within the address generation stage must be reduced in order to perform the address generation within one clock cycle. Consequently, the resulting prediction will tend to be less accurate, and computer system performance may suffer.
FIG. 2 illustrates pipeline execution timing for a system that extends branch prediction operation over two clock cycles. This allows for more accurate branch prediction than is provided by the single cycle scheme. The operations associated with fetch group one in the first row include branch target lookup 202, address generation 204, I-cache-0 latency 206, I-cache-1 latency 208 and I-cache-2 latency 210. The operations associated with fetch group two in the second row include branch target lookup 212, address generation 214, I-cache-0 latency 216, I-cache-1 latency 218 and I-cache-2 latency 220. The operations associated with fetch group three in the third row include branch target lookup 222, address generation 224, I-cache-0 latency 226, I-cache-1 latency 228 and I-cache-2 latency 230.
Note that providing two cycles for branch prediction may introduce a pipeline bubble as is illustrated in FIG. 2. In the example illustrated in FIG. 2, fetch group 2 is located at the next sequential address to fetch group 1. Hence, there is only a one-cycle delay between fetch group one and fetch group two. However, in fetch group 2, a branch is taken. Consequently, generating the predicted address for fetch group three requires two pipeline stages, branch target lookup 212 and an address generation 214. This means fetch group three cannot proceed until the result of address generation operation 214 for fetch group two becomes available after two cycles. Hence, there is a two-cycle delay between fetch group two and fetch group three.
What is needed is a system that performs branch target prediction, which does not suffer from the poor prediction performance of a single-cycle branch prediction scheme, or the pipeline bubbles of a multiple cycle branch prediction scheme.