1. Technical Field
The present invention relates generally to computer processing systems and, in particular, to a decoupled fetch-execute engine with static branch prediction support.
2. Background Description
Early microprocessors generally processed instructions one at a time. Each instruction was processed using four sequential stages: instruction fetch; instruction decode; instruction execute; and result writeback. Within such microprocessors, different dedicated logic blocks performed each different processing stage. Each logic block waited until all the preceding logic blocks completed operations before beginning its operation.
Improved computational speed has been obtained by increasing the speed with which the computer hardware operates and by introducing parallel processing in one form or another. xe2x80x9cSuperscalarxe2x80x9d and xe2x80x9cVery Long Instruction Wordxe2x80x9d (VLIW) microprocessors have been recently introduced to implement parallel processing. They have multiple execution units (e.g., multiple integer arithmetic logic units (ALUs)) for executing instructions and, thus, having multiple xe2x80x9cpipelinesxe2x80x9d. As such, multiple machine instructions may be executed simultaneously in a superscalar or VLIW microprocessor, providing obvious benefits in the overall performance of the device and its system application.
For the purposes of this discussion, latency is defined as the delay between the fetch stage of an instruction and the execution stage of the instruction. Consider an instruction which references data stored in a specified register. Such an instruction requires at least four machine cycles to complete. In the first cycle, the instruction is fetched from memory. In the second cycle, the instruction is decoded. In the third cycle, the instruction is executed and, in the fourth cycle, data is written back to the appropriate location.
To improve efficiency and reduce instruction latency, microprocessor designers overlapped the operations of the fetch, decode, execute, and writeback logic stages such that the microprocessor operated on several instructions simultaneously. In operation, the fetch, decode, execute, and writeback logic stages concurrently process different instructions. At each clock pulse the result of each processing stage is passed to the subsequent processing stage. Microprocessors that use the technique of overlapping the fetch, decode, execute, and writeback stages are known as xe2x80x9cpipelinedxe2x80x9d microprocessors. In principle, a pipelined microprocessor can complete the execution of one instruction per machine cycle when a known sequence of instructions is being executed. Thus, it is evident that the effects of the latency time are reduced in pipelined microprocessors by initiating the processing of a second instruction before the actual execution of the first instruction is completed.
Unfortunately, various scenarios may exist where a stall is induced in a pipelined microprocessor. One such scenario is the branch instruction. In general, the instruction flow in a microprocessor requires that instructions are fetched and decoded from sequential locations in memory. A branch instruction is an instruction that causes a disruption in this flow, e.g., a taken branch causes decoding to be discontinued along the sequential path, and resumed at a new location in memory. The new location in memory may be referred to as a target address of the branch. Such an interruption in pipelined instruction flow results in a substantial degradation in pipeline performance.
As architecture and compiler designers continue to strive for greater degrees of parallelism, the effect of pipeline stall penalties on parallelism becomes very significant. For high levels of parallelism, the average number of cycles spent executing an instruction (CPI) must be much less than 1. Such a small CPI is only possible by minimizing the CPI penalties from stalls, thereby reducing their impact upon pipeline throughput. The problem of reducing stall penalties is aggravated by the potentially greater frequency of stalls due to higher instruction issue rates. It becomes necessary to find more capable methods for decreasing these penalties. Two common methods for reducing stall penalties include decoupled architectures and branch prediction.
Decoupled architectures use buffering and control mechanisms to dissociate memory accesses from the rest of the pipeline. When a cache miss occurs, the decoupled architecture allows the rest of the pipeline to continue moving forward, only stalling those instructions dependent upon that cache access. Decoupling of cache accesses from the pipeline can be used with either instruction or data caches. Decoupling of the instruction cache from the execute pipeline, hereafter referred to as decoupled fetch-execute, is beneficial for both superscalar and EPIC/VLIW (Very Long Instruction Word) architectures. The EPIC architecture is further described by L. Gwennap, in xe2x80x9cIntel, HP make EPIC disclosurexe2x80x9d, Microprocessor Report, 11(14): Oct. 1-9, 1997.
Decoupled fetch-execute architectures use instruction buffers and branch prediction to enable instruction fetching to be independent from the rest of the pipeline. The instruction buffers are organized as a queue that receives instructions as they are fetched from the instruction cache. As instructions enter the queue, a branch prediction mechanism checks for the existence of a branch instruction. When a branch is found, the branch prediction mechanism predicts the likely branch target (address) and direction. If necessary, the branch prediction mechanism redirects the instruction fetch to the predicted address.
Most general-purpose processors today use dynamic branch prediction mechanisms, which select at execution time the direction a branch is expected to take. Dynamic branch prediction mechanisms can include tables of prediction counters, history tables, and branch target buffers. Many of these schemes add considerable hardware, and may affect the processor frequency. Dynamic branch prediction schemes are described by: Calder et al., in xe2x80x9cA System Level Perspective on Branch Architecture Performancexe2x80x9d, Proceedings of the 16th Annual International Symposium on Computer Architecture, pp. 199-206, May 1989; and Chang et al., in xe2x80x9cComparing software and hardware schemes for reducing the cost of branchesxe2x80x9d, Proceedings of the 16th Annual International Symposium on Computer Architecture, pp. 224-233, May 1989.
Static branch prediction provides an alternate prediction method, as it corresponds to selecting at compile time the direction a branch is expected to take. Static branch prediction does not perform as well as dynamic branch prediction for most general-purpose applications, but does do well in some application markets, so architectures for these markets may be able to forego the cost of dynamic branch prediction. Such markets include media processing and binary translation in software, which performs run-time compilation using dynamic profile statistics, enabling accurate static branch prediction. Media processing is further described by Fritts et al., in xe2x80x9cUnderstanding multimedia application characteristics for designing programmable media processorsxe2x80x9d, SPIE Photonics West, Media Processors ""99, San Jose, Calif., January 1999. Binary translation in software is described by Altman et al., in xe2x80x9cExecution-based Scheduling for VLIW Architecturesxe2x80x9d, submitted to Euro-Par ""99, Toulouse, France, September 1999.
In static branch prediction, conditional branches that are predicted as not taken, i.e. those expected to fall through to the sequential path, are easily supported since instruction fetch logic automatically continues sequentially. Unconditional branches and conditional branches that are predicted as taken, i.e. those expected to continue execution at a non-sequential target instruction, require support for redirecting the instruction fetch unit to begin prefetching the expected branch target prior to execution of the branch. It is desired that this prefetching begin immediately after the fetch of the branch instruction to enable execution of the expected branch target right after the branch executes.
One method for accomplishing this uses a prediction bit in the branch operation. After fetching the branch operation, the expected branch target address is sent to the instruction fetch unit if the prediction bit indicates taken. A problem with this method is that determination of the prediction direction and target address requires access to the contents of the branch operation. The expected branch target can only be fetched once the branch operation is returned by the instruction cache, the direction set by the prediction bit is determined, and the expected branch target address has been computed. FIG. 1 is a diagram illustrating the timing in fetching the predicted branch target. Option 1 corresponds to the desired timing for fetching the predicted branch target, which is right after beginning fetch of the branch operation. Option 2 corresponds to the actual fetch timing, which is delayed due to the need for the contents of the branch operation. As shown, in an instruction cache with f stages, the earliest the contents of the branch operation become available is f cycles after the fetch was initiated. However, the desired time to begin fetching the expected branch target is only 1 cycle after the branch begins being fetched. Consequently, the use of a prediction bit for performing static branch prediction will usually not allow ideal timing for fetching the predicted branch target, but will insert at least fxe2x88x921 delay cycles between the branch and predicted target.
An alternative technique is to issue a fetch hint operation (referred to hereinafter as a xe2x80x9cprepare-to-branch (PBR)xe2x80x9d operation) for the expected branch target. The PBR operation typically has one main field that indicates the address or displacement of the predicted branch target. Additional fields may include a register destination for the address of the expected branch target, or a predicate condition which indicates whether to execute the PBR operation. Such a predicate is particularly useful for implementing more intelligent static branch prediction methods, such as branch correlation. Performing static branch correlation without using predication can require substantial code duplication. A discussion of branch correlation is provided by: Smith et al., in xe2x80x9cImproving the accuracy of static branch prediction using branch correlationxe2x80x9d, Proceedings of the 6th Annual International Conference on Architectural Support for Programming Languages and Operating Systems, October 1994; and Gloy et al., in xe2x80x9cPerformance issues in correlated branch prediction schemesxe2x80x9d, Proceedings of the 28th Annual International Symposium on Microarchitecture, Ann Arbor, Mich., pp. 3-14, November 1995.
A critical aspect of the prepare-to-branch operation is timing. The PBR operation should be scheduled to begin fetching the expected branch target immediately after initiating fetching the corresponding branch operation, as indicated by option 1 in FIG. 1. The PBR operation cannot redirect fetching earlier as that will prevent the branch operation from being fetched. Also, the PBR operation should not redirect fetching later to avoid extra delay between the branch and the predicted target. Achieving this timing requires two mechanisms. First, a mechanism is necessary for associating the PBR operation with the branch it is predicting. This association is hereafter referred to as xe2x80x9cpairingxe2x80x9d, and the corresponding branch is called the xe2x80x9cpaired branchxe2x80x9d. Second, a mechanism is necessary for recognizing that fetching of the paired branch has started and that fetching of the expected branch target may begin.
There are two principal approaches for implementing the prepare-to-branch operation. The first approach is commonly used in in-order lock-step pipelines. A lock-step pipeline is a pipeline in which all stages of the pipeline advance together. If any one pipeline stage cannot advance (for whatever reason), then none of the pipeline stages advance (i.e., all stages either xe2x80x9cstepxe2x80x9d together, or xe2x80x9clockxe2x80x9d together). The first approach is to schedule the PBR operation a fixed number of instructions before the branch. The fixed-position of the branch with respect to the PBR operation serves both as the means for uniquely defining the paired branch as well as indicating when fetching of the expected branch target begins. The dependent nature of all pipeline stages in a lock-step pipeline ensures correct fetch timing in the fixed-position method. However, the fixed-position timing model is only effective on lock-step pipelines and cannot be used in decoupled fetch-execute architectures, which eliminate the dependency between the execution pipeline and instruction fetch pipeline. This approach is described in further detail by Patterson et al., in xe2x80x9cRISC I: A Reduced Instruction Set VLSI Computerxe2x80x9d, Proceedings of the 8th Annual Symposium on Computer Architecture, April 1981.
The second approach for implementing the prepare-to-branch operation uses a register in the PBR operation for pairing with the branch operation. The branch operation uses the same register to provide its target address. The register name provides a means for pairing without necessitating a fixed position for the PBR operation, and allows greater scheduling freedom. Implementing timing for this technique requires first determining if the branch operation is available before starting the fetching of the predicted branch target. Availability of the branch operation can be determined by searching the newly fetched instructions, the instruction buffers, and the pipeline, for a branch operation using the same register as the PBR operation. Once the paired branch is found, fetching of the expected branch target may begin. Like the prediction bit scheme, this scheme also requires access to the contents of the branch operation before enabling the fetching of the expected branch target, so it too forces a minimum delay of fxe2x88x921 cycles between the fetching of the branch and its predicted target. This scheme is described in further detail in the following: Kathail et al., xe2x80x9cHPL PlayDoh Architecture Specification: Version 1.0xe2x80x9d, HPL-93-80, February 1994.
An alternative approach for pairing is to indicate the number of instructions after the PBR operation that the paired branch occurs. However, this approach is expected to require greater complexity for decoupled fetch-execute pipelines, particularly for implementations in explicitly parallel architectures with compressed instruction formats, where the size of an instruction is unknown prior to decoding. For an article describing this approach, see Goodman et al., xe2x80x9cA simulation study of architectural data queues and prepare-to-branch instructionxe2x80x9d, Proceedings of the IEEE International Conference on Computer Design ""84, Port Chester, N.Y., 1984, pp. 544-549.
Despite the existence of numerous static branch prediction schemes, the majority of such schemes have been designed for lock-step pipelines and thus, do not adapt well to decoupled fetch-execute architectures.
Accordingly, it would be desirable and highly advantageous to have a method and apparatus for supporting static branch prediction on a decoupled fetch-execute pipeline.
The present invention is directed to a decoupled fetch-execute engine with static branch prediction support. The present invention allows the predicted branch target of a branch instruction to be fetched immediately after fetching the branch instruction. Contrary to existing static branch prediction methods, the contents of the branch operation are not required prior to fetching the predicted branch target.
According to a first aspect of the present invention, a method for prefetching targets of branch instructions in a computer processing system having instruction fetch decoupled from an execution pipeline includes the step of generating a prepare-to-branch (PBR) operation. The PBR operation includes address bits corresponding to a branch paired thereto and address bits corresponding to an expected target of the branch. The execution of the PBR operation is scheduled prior to execution of the paired branch to enforce a desired latency therebetween. Upon execution of the PBR operation, it is determined whether the paired branch is available using the address bits of the PBR operation corresponding to the paired branch. When the paired branch is available, the expected branch target is fetched using the address bits of the PBR operation corresponding to the expected branch target.
According to a second aspect of the present invention, the method further includes the steps of decoding the paired branch. When a misprediction is detected, all operations following the paired branch are invalidated and the correct target of the paired branch is fetched.
According to a third aspect of the present invention, the step of fetching the correct target includes the step of fetching an instruction immediately following the paired branch, when the paired branch is mispredicted taken.
According to a fourth aspect of the present invention, the step of fetching the correct target includes the step of fetching an instruction corresponding to the target address specified in the paired branch, when the paired branch is mispredicted not taken.
These and other aspects, features and advantages of the present invention will become apparent from the following detailed description of preferred embodiments, which is to be read in connection with the accompanying drawings.