Pipeline processors decompose the execution of instructions into multiple successive stages, such as fetch, decode, and execute. Each stage of execution is designed to perform its work within the processor's basic machine cycle. Hardware is dedicated to performing the work defined by each stage. As the number of stages is increased, while keeping the work done by the instruction constant, the processor is said to be more heavily pipelined. Each instruction progresses from stage to stage, ideally with another instruction progressing in lockstep only one stage behind. Thus, there can be as many instructions in execution, as there are pipeline stages.
The major attribute of a pipelined processor is that a throughput of one instruction per cycle can be obtained, though when viewed in isolation, each instruction requires as many cycles to perform as there are pipeline stages. Pipelining is viewed as an architectural technique for improving performance over what can be achieved via process or circuit design improvements.
The increased throughput promised by the pipeline technique is easily achieved for sequential control flow. Unfortunately, programs experience changes in control flow as frequently as one out of every three executed instructions. Taken branch instructions are a principal cause of changes in control flow. Taken branches include both conditional branches that are ultimately decided as taken and unconditional branches. Taken branches are not recognized as such until the later stages of the pipeline. If the change in control flow were not anticipated, there would be instructions already in the earlier pipeline stages, which due to the change in control flow, would not be the correct instructions to execute. These undesired instructions must be cleared from each stage. In keeping with the pipeline metaphor, the instructions are said to be flushed from the pipeline.
The instructions to be first executed where control flow resumes following a taken branch are termed the branch target instructions (target instructions). The first of the target instructions is at the branch target address (target address). If the target instructions are not introduced into the pipeline until after the taken branch is recognized as such and the target address is calculated, there will be stages in the pipeline that are not doing any useful work. Since this absence of work propagates from stage to stage, the term pipeline bubble is used to describe this condition. The throughput of the processor suffers whenever such bubbles occur.
Branch Prediction Caches (BPCs), also known as Branch Target Buffers (BTBs), are designed to reduce the occurrence of pipeline bubbles by anticipating taken branches. BPCs store information about branches that have been previously encountered. An Associative Memory is provided in which an associatively addressed tag array holds the address (or closely related address) of recent branch instructions. The data fields associated with each tag entry may include information on the target address, the history of the branch (taken/not taken), and branch target instruction bytes. The history information may take the form of N-bits of state (N is typically 2), which allows an N-bit counter to be set up for each branch tracked by the BPC.
The fetch addresses used by the processor are coupled to the branch address tags. If a hit occurs, the instruction at the fetch address causing the hit is presumed to be a previously encountered branch. The history information is accessed and a prediction on the direction of the branch is made based on a predetermined algorithm. If the branch is predicted not taken, then the pipeline continues as usual for sequential control flow. If the branch is predicted taken, fetching is performed from the target address instead of the next sequential fetch address. If target instruction bytes were cached, then these bytes are retrieved directly from the BPC. Because of using a BPC, many changes in control flow are anticipated, such that the target instructions of taken branches contiguously follow such branches in the pipeline. When anticipated correctly, changes in control flow due to taken branches do not cause pipeline bubbles and the associated reduction in processor throughput. Such bubbles occur, only when branches are mispredicted. Conventionally, instructions fetched from the predicted direction (either taken or not-taken) of a branch are not allowed to modify the state of the machine unit the branch direction is resolved. Operations normally may only go on until time to write the results in a way that modifies the programmer visible state of the machine. If the branch is actually mispredicted, then the processor can flush the pipeline and begin anew in the correct direction, without any trace of having predicted the branch incorrectly. Further instruction issue must be suspended until the branch direction is resolved. A pipeline interlock is thus provided to handle this instruction dependency. Waiting for resolution of the actual branch direction is thus another source of pipeline bubbles.
It is possible to perform speculative execution (also known as conditional, or out-of-order execution) past predicted branches, if additional state is provided for backing up the machine state upon mispredicted branches. Speculative execution beyond an unresolved branch can be done whether the branch is predicted taken or not-taken. An unresolved branch is a branch whose true taken or not-taken status has yet to be decided. Such branches are also known as outstanding branches.
Pipelining was extensively examined in "The Architecture of Pipelined Computers," by Peter M. Kogge (McGraw-Hill, 1981). A more recent treatment was provided by chapter 6 of "Computer Architecture, A Quantitative Approach," by J. L. Hennessy and D. A. Patterson (Morgan Kaufmann, 1990). Branch prediction and the use of a BTB are taught in section 6.7 of the Hennessy text. The Hennessy text chapter references provided pointers to several notable pipelined machines and for several contemporary papers on reducing branch delays. D. R. Ditzel and H. R. McLellan, "Branch folding in the CRISP microprocessor: Reducing the branch delay to zero," Proceedings of the 14th Symposium on Computer Architecture, June 1987, Pittsburgh, pg. 2-7, provided a short historical overview of hardware branch prediction. J. K. F. Lee and A. J. Smith, "Branch Prediction Strategies and Branch Target Buffer Design," IEEE Computer, Vol. 17, January 1984, pg. 6-22, provided a thorough introduction to branch prediction. Three more recent works include 1) "Branch Strategy Taxonomy and Performance Models," by Harvey G. Cragon (IEEE Computer Society Press, 1992), 2) "Branch Target Buffer Design and Optimization," by C. H. Perleberg and A. J. Smith, IEEE Transactions on Computers, Vol. 42, April 1993, pg. 396-412, and 3) "Survey of Branch Prediction Strategies," by C. O. Stjernfeldt, E. W. Czeck, and D. R. Kaeli (Northeastern University technical report CE-TR-93-05, Jul. 28, 1993).
Several recent commercial machines have employed branch prediction. The AMD Am29050 (TM) Microprocessor had a 256-entry Branch Target Cache (BTC) that cached target addresses and target instruction bytes. The operation of the Am29050 BTC was described in the Am29050 Microprocessor User's Manual, 1991. A similar BTC was used in the GE RPM40, according to Perleberg and Smith. Perleberg and Smith also reported that the Mitsubishi M32 had a BTB that cached prediction information, branch addresses, and target instruction bytes. The IBM Enterprise System/9000 (TM) 520-based models had a 4096-entry Branch History Table (BHT) that cached branch addresses and target addresses. The operation of the 520-based machines was described in the July 1992 issue of the IBM Journal of Research and Development. The Intel Pentium (TM) Microprocessor had a 256-entry BTB that cached branch addresses, target addresses, and 2-bits of history information. This operation of the Pentium BTB was described in the Mar. 29, 1993 issue of Microprocessor Report (MicroDesign Resources, 1993).
The principles of out-of-order execution are also well known in the art. As background, out-of-order execution in the IBM System/360 Model 91 was discussed in section 6.6.2 of Kogge. The January 1967 issue of the IBM Journal of Research and Development was devoted to the Model 91. More recently, the aforementioned IBM Enterprise System/9000 520-based models performed speculative execution.
U.S. Pat. No. 5,226,126, ('126) PROCESSOR HAVING PLURALITY OF FUNCTIONAL UNITS FOR ORDERLY RETIRING OUTSTANDING OPERATIONS BASED UPON ITS ASSOCIATED TAGS, to McFarland et al., issued Jul. 6, 1993, which is assigned to the assignee of the present invention, described speculative execution in the system in which the instant invention is used, and is hereby incorporated by reference.
U.S. Pat. No. 5,093,778, ('778) INTEGRATED SINGLE STRUCTURE BRANCH PREDICTION CACHE, to Favor et al., issued Mar. 3, 1992, which is assigned to the assignee of the present invention, teaches the implementation of the various components comprising a branch prediction cache as one integrated structure, and is hereby incorporated by reference. An integrated structure provides for reduced interconnect delays and lower die costs, due to smaller size. The '778 BPC was designed for use in a processor that uses out-of-order (speculative) execution. The '778 BPC caches branch addresses, history information, target addresses, and target instruction bytes.
U.S. Pat. No. 5,226,130 ('130) METHOD AND APPARATUS FOR STORE-INTO-INSTRUCTION-STREAM DETECTION AND MAINTAINING BRANCH PREDICTION CACHE CONSISTENCY, to Favor et al., issued Jul. 6, 1993, which is assigned to the assignee of the present invention, teaches the use of a BPC for detecting stores into the instruction stream and stores to instructions held within the BPC, and is hereby incorporated by reference.
U.S. Pat. No. 5,230,068 ('068) CACHE MEMORY SYSTEM FOR DYNAMICALLY ALTERING SINGLE CACHE MEMORY LINE AS EITHER BRANCH TARGET ENTRY OR PREFETCH INSTRUCTION QUEUE BASED UPON INSTRUCTION SEQUENCE, to Van Dyke et al., issued Jul. 20, 1993, which is assigned to the assignee of the present invention, teaches the use of lines in the BPC for either branch target entries or as instruction queues, and is hereby incorporated by reference.
BPCs have previously maintained a single entry in the tag array for each branch address. In the data fields associated with each branch address tag was a single target address. This target address can change for a variety of reasons. Such changes are not discoverable until late in the pipeline. If the target address is different from that held in the BPC, it is said to be a mispredicted target address. If the target address is mispredicted the target instruction bytes associated with the address will also be incorrect. This is true whether or not the target bytes were cached. A mispredicted target address will result in a pipeline bubble just as a mispredicted direction would.
Return (RET or RTN) instructions pose a problem for the previously described BPC-based branch prediction approaches. RTN instructions are unconditional transfers that terminate subroutines by transferring control flow back to the instruction immediately following the CALL instruction that invoked the subroutine. The address of the instruction after the Call, called the return address, is commonly stored on a stack maintained in the main memory of the processor. Generally subroutines are called from many different program locations. Because of having multiple callers, there can be multiple target address associated with a RTN. Because the target address can be constantly changing, RTNs can be constantly mispredicted. The BPC will update the target address upon every misprediction, possibly thrashing between a fixed set of two or more addresses in a "ping-pong" like manner.
The reduction of branch delays associated with return instructions was addressed in U.S. Pat. No. 4,399,507 ('507), INSTRUCTION ADDRESS STACK IN THE DATA MEMORY OF AN INSTRUCTION-PIPELINED PROCESSOR, to Cosgrove et al., issued Aug. 16, 1983. This invention teaches the on-chip caching (in the processor) of the top of a return address stack, the stack being kept in off-chip storage. When a fetched instruction is recognized as being a return instruction, the on-chip return address storage permits directly fetching the target of the return. It is not necessary to first fetch the return address from off-chip storage.
A first significant aspect of the '507 approach is that it makes no provision for branches other than return instructions. A second significant aspect of this invention is that only the return address for a single RET instruction is cached on-chip. Following a RET, the on-chip return address cache is updated using otherwise unused pipeline cycles. In more general or aggressive implementations, such unused cycles may not be available. A third significant aspect of this approach is that the RET instruction must proceed to the stage at which decoding is performed before the target instruction bytes can be fetched. A fourth significant aspect of this approach is that no provisions are made for caching target instruction bytes.
The problem of multiple target addresses for a given branch address was addressed previously in U.S. Pat. No. 4,725,947 ('947), DATA PROCESSOR WITH A BRANCH TARGET INSTRUCTION STORAGE, to Shonai et al., issued Feb. 16, 1988. This invention teaches the use of a 128K-entry two-way set-associative target instruction cache whose tags include register specifier fields from the branch instruction along with the branch address. The register specifier fields are those that would be used by the branch instruction to generate the target address. Every taken branch is cached and the entry marked valid. If there is a tag hit, the branch is predicted taken and the cached target instruction bytes are provided directly to the instruction buffer, avoiding the need to fetch the target bytes. Hits on branches that are subsequently not taken, cause the tag to be invalidated, such that subsequent hits are not possible. Upon every hit, whether the branch is taken or not, all fields except the valid bit are rewritten as part of the LRU-Replacement scheme.
A first significant aspect of the '947 approach is that the register specifier fields of the branch instruction are not available initially. As a result, the branch instruction must proceed to the stage at which partial decoding is performed before the CAM can be accessed. A second significant aspect of this approach is that it makes no provision for distinguishing between multiple target addresses for a RET instruction, which has no register specifier fields for generating the target address. The target address associated with a RET instruction must be retrieved from the stack. A third significant aspect of '947 is that, other than the RAM array itself, it does not represent an integrated solution.
The difficulty of correctly predicting branch target addresses associated with the subroutine call/return paradigm was dealt with in "Branch History Table Prediction of Moving Target Branches Due to Subroutine Returns," by D. R. Kaeli and P. G. Emma, in Proceedings of the 18th Annual International Symposium on Computer Architecture, 1991, pgs. 34-42. Kaeli and Emma proposed and simulated a Branch History Table (BHT) used in conjunction with separate Call and Return "Stacks."
The stacks were rather unconventional. In addition to implementing conventional push-down behavior, it was implied that the Call and Return Stacks were also fully associative memories. In the event of a hit, an entry at any depth could be read. In the event of multiple tag matches, it was further impled that priority logic was used to qualify only the topmost matching entry. Furthermore, corresponding entries in the two stacks were bidirectionally coupled with each other. The purpose of the coupling was to permit a hit in the Return Stack to be used to read an entry from the Call stack, and vice versa.
The BHT was largely conventional, having fields for branch and target addresses and "predictions." The only modification to the BHT was the addition of a new bit field that could designate each entry as special. If an entry was designated as special, the target address field held a key used to access the Call Stack. If an entry was not special, the target address field supplied the target address directly, as was done conventionally.
In the absence of Call and Return instructions, the BHT functioned conventionally. In addition to establishing a conventional entry in the BHT, Call instructions caused the target address of the Call (the start of the subroutine) to be pushed onto the Call Stack and the return address (the next sequential address after the Call) to be pushed onto the Return Stack.
Executing Return instructions caused a special entry to be established in the BHT, when one did not previously exist. Specifically, the target address for the return (the previously mentioned return address) was presented to the Return Stack to check for a hit. In the event of a hit, the target address in the corresponding entry in the Call Stack supplied the target address for the BHT entry. The BHT entry was marked as special.
Once a special entry had been established, a hit in the BHT would occur next time the same Return address was prefetched. Handling of the hit was modified due to the entry being marked special. As mentioned previously, the target address field was presented to the Call Stack to check for a hit. In the event of a hit, the target address in the corresponding entry in the Return Stack supplied the target address used for the prediction.
In summary, in the Kaeli and Emma approach, subroutine returns were specially designated in the BHT. Only one entry was established in the BHT for each subroutine return. The target address for subroutine returns came not from the BHT, but from the linked Call/Return Stacks. A first significant aspect of the Kaeli and Emma approach is that only one entry is maintained in the branch prediction cache for a return instruction, no matter how many callers the subroutine may have. A second significant aspect is that no provisions are made for the caching of target instruction bytes. Thus, Kaeli and Emma do not teach how to provide target instruction bytes for returns associated with subroutines having multiple callers. A third significant aspect is that an associative dual-stack structure with associated complex interconnect and control is required.
The use of a return address stack in conjunction with a branch prediction cache was also taught in U.S. Pat. No. 5,136,696 ('696), HIGH-PERFORMANCE PIPELINED CENTRAL PROCESSOR FOR PREDICTING THE OCCURRENCE OF EXECUTING SINGLE-CYCLE INSTRUCTIONS AND MULTICYCLE INSTRUCTIONS, to Beckwith et al., issued Aug. 4, 1992. '696 was focussed specifically on the execution of multicycle instructions using microinstructions in an instruction-cache-based interpreter.
In '696, the branch prediction cache was largely conventional, having fields for branch and target addresses and "predictions." The only modification to the branch prediction cache was the addition of a new 2-bit prediction-type field that could designate each entry as either a normal, branch, interpreter call, or interpreter return prediction. The target address field was only used for branch and interpreter call prediction types.
In the absence of multicycle instructions, the branch prediction cache functioned conventionally. In the event of a multicycle instruction, an interpreter call entry was established in the branch prediction cache. Subsequently, if a hit occurred on an entry marked interpreter call prediction, the program counter was loaded from the target address field of the branch prediction cache. Additionally, the return address was pushed onto the return address stack. If a hit occurred on an entry marked interpreter return prediction, the program counter was loaded from the top of the return address stack and not the target address field of the branch prediction cache.
A first significant aspect of the '696 approach is that only one entry is maintained in the branch prediction cache for a return instruction, no matter how many callers the subroutine may have. A second significant aspect is that no provisions are made for the caching of target instruction bytes. Thus, '696 does not teach how to provide target instruction bytes for returns associated with subroutines having multiple callers.