1. Field of the Invention
The invention generally relates to the field of data processing, and the processing of branch instructions in computing machines. More particularly, the invention is directed to improved methods and apparatus for processing conditional branch instructions, including conditional branches that can vary in both outcome and test operand location.
2. Description of the Related Art
In high performance processors it is a common practice to decompose an instruction into several steps with each step being performed by one of a plurality of different step-processing units. Each step-processing unit typically has the capability of accepting a specific step, for successive instructions, every cycle.
For example, consider a pipeline whose stages are: i) instruction decode (DEC), ii) address generation (AGEN) for operand addresses (or in the case of branch instructions, for branch target addresses), iii) cache access (CACHE) to fetch operands (or in the case of branch instructions, to fetch branch target instructions), and iv) execute (EXEC) to perform the functional operation on the input operands as specified by the instruction. In the absence of branch instructions, the processor can decode a new instruction every cycle. Thus, in the exemplary pipeline four instructions can be in some phase of operation simultaneously.
It is common practice to overlap the successive steps in executing an instruction on a cycle by cycle basis with each following instruction having a one cycle offset. Ideally, this allows one instruction to be handled each cycle even though any given instruction takes several cycles to complete.
This ideal overlap is not always possible for several reasons. A major reason is the frequent occurrence of branch instructions.
When a branch is decoded, further decoding stops since the branch target address must be computed (AGEN) and the target instruction must be fetched (CACHE) before it can enter the pipeline. Hence, even where the branch is unconditional, at least two cycles are "wasted" before processing can continue resulting in a degradation of processor performance.
The unconditional branch referred to in the example is the least severe on processor performance with respect to taken branches. The unconditional branch transfers control from the branch instruction to the target instruction (TARG). That is, at the time that the branch instruction is decoded, it is known that the transfer of control (to TARG) will take place. A more costly (in terms of performance) branch instruction is the conditional branch. This instruction specifies that control is to be transferred to TARG only if some condition (as determined by the outcome of a previous instruction) is met.
A conditional branch instruction would cause a (nominal) penalty of one additional cycle in the examplary pipeline, since a conditional branch must complete execution (EXEC) to determine that control is (or is not) to be transferred to TARG. If it is determined that control is not to be transferred to TARG, then the instruction that is to be decoded following the branch is the next sequential (by address) instruction to the branch. Thus, even when a conditional branch instruction is not taken, there is still a nominal delay of three cycles (in this example) associated with the branch.
Clearly, if it can be determined at decode time that a conditional branch instruction will not be taken, then there would be no penalty associated with the instruction, i.e., the next sequential instruction can be decoded immediately following the decode of the branch instruction. However, if it is determined at decode time that the branch instruction will be taken, then there is still a two-cycle penalty associated with the branch, i.e, the target address must be generated, and the target instruction must be fetched (but the extra cycle in which the branch is executed (EXEC) is saved in this case).
A number of patents are directed to branch prediction mechanisms, each having certain advantages and disadvantages. For example, U.S. Pat. No. 4,370,711 to Smith discloses a branch predictor for predicting in advance the result of a conditional branch instruction in a computer system. The principle upon which the system is based is that a conditional branch instruction is likely to be decided in the same way as the instructions most recent execution.
Other patents, such as U.S. Pat. No. 4,477,872 to Losq, et al and U.S. Pat. No. 4,430,706 to Sand, describe a decode time prediction mechanism called a "Decode History Table" (DHT). A decode time prediction mechanism, such as a DHT, will save (if it predicts correctly), three cycles for branches that are not taken, and one cycle for branches that are taken in the exemplary pipeline.
The DHT is a table of entries where an entry is accessed based on a transformation (hash or truncation) on the bits that compose the address of a branch instruction. The entry itself comprises a single bit: the bit is set if the corresponding branch instruction was taken the last time that it was executed, otherwise the bit is not set.
When a conditional branch instruction is decoded, the DHT is accessed with the address of the branch instruction. If the DHT entry is set, then it is guessed that the branch will be taken; the target address is generated, and the target instruction is fetched, and decoded on the third cycle following the decode of the branch instruction (thereby saving one cycle of delay). If the DHT entry is not set, then it is guessed that the branch will not be taken; the next-sequential instruction is decoded on the cycle following the decode of the branch instruction (thereby saving three cycles of delay). If it is found that the DHT predicted erroneously (i.e., the prediction did not agree with the branch outcome as computed in EXEC), then the corresponding entry is corrected.
A decode-time prediction mechanism offers an opportunity to avoid: all penalty for not-taken branches, and the execution-time penalty (typically one cycle) for taken branches. Variations on decode-time mechanisms can only reduce branch penalty further via more accurate prediction. However, even in the limit (i.e., 100% accuracy), a decode-time mechanism cannot eliminate all branch penalty. Specifically, whenever there is a taken branch, there is a penalty equal to the time to generate the target address and fetch the target instruction. This is because a decode time mechanism like the DHT provides a way of queing the action but not the target, of a conditional branch instruction. Therefore, the only way to reduce branch peanlty even further is to anticipate taken branches and to fetch target instructions prior to the time that the branch instructions are actually encountered (decoded). So called "prefetch-time prediction mechanisms" attempt to do this.
To achieve this further reduction in branch penalties an autonomous instruction-prefetching "engine" must exist. In the absence of a prefetch-time prediction mechanism per se, a simple prefetch engine may comprise: i) an incrementer used to "step" through sequential instruction addresses, ii) an instruction buffer for holding sequential instructions to be "consumed" by the decoder, iii) a means for using the sequential addresses produced by the incrementer to fetch sequential blocks of instructions from the cache and place them in the instruction buffer, and iv) a means for the processor to supply a new starting address (branch target address) to the incrementer in the event of a taken branch instruction. By "autonomous", it is meant that the engine if free-running (independent of the decoder) so that (in the absence of taken branches) the instruction buffer always contains next-sequential instructions to be consumed by the decoder. (Hence, there is no penalty for correctly guessed conditional branches that are not taken.)
A prefetch-time prediction mechanism is a mechanism that is incorporated into the prefetch engine (as opposed to a decode-time mechanism, which operates in conjunction with the decoder). A prefetch-time mechanism ensures that the instruction buffer contains the branch target instruction at the time that the branch instruction is decoded; if it is successful in this endeavor, then the branch target instruction can be decoded immediately following the decode of the branch instruction. Thus, a prefetch-time mechanism eliminates all branch penalty (even for taken branches) when it predicts correctly.
Most (if not all) prefetch-time prediction mechanisms are variations on the "Branch History Table" (BHT) as first described in U.S. Pat. No. 3,559,183, to Sussenguth, assigned to the assignee as the present invention.
The strategy taught in the Sussenguth patent is based on the observation that most branches, considered individually, are consistently either taken or not taken and if taken, will have a consistent target address. In this strategy a table of taken branches is constructed. Each entry in the table consists of the address of the taken branch followed by the target address of the branch. This table is a hardware construct and so it has a predetermined size, typically from 1024 and 4096 entries. Entries are made only for taken branches as they are encountered. When the table is full making a new entry requires displacing an older entry. This can be accomplished on a Least Recently Used (LRU) basis as in caches.
A BHT is the prefetch time analog of the Decode History Table. That is, BHT entries are accessed based on a transformation (hash or truncation) on the bits that compose the address of the block of instructions that is being prefetched. The entry itself is much more complex than a DHT entry, since the BHT is "blindly" operating at prefetch time, i.e., it is merely fetching blocks of instructions without the benefit of being able to examine the content of these blocks.
A BHT entry must be able to identify that the associated block of instructions contains a taken branch (based on the processor having previously encountered a taken branch within the block). Further, it must be able to identify where (within the block) the taken branch instruction resides, since the particular branch instruction may (or may not) be relevant to current instruction fetching depending on where the block is entered (i.e., depending on current branch activity). Finally, a BHT entry must specify the branch target address, so that prefetching can be immediately redirected down the target path should the particular branch be relevant to the current prefetch activity. Known BHTs have these abilities.
According to the prior art, when the processor encounters a branch instruction that is found to be taken, it creates a BHT entry based on the address of the branch (the entry itself will contain the branch target address). If the particular section of code (containing the branch) is ever reencountered, then the BHT entry is able to cause prefetching to be redirected at the time that the branch instruction is prefetched. When the BHT redirects prefetching, it also enqueues information regarding this action (e.g., the address at which it "believes" there is a taken branch, and the target address of the branch) at the processor. As the processor subsequently executes the code that has been prefetched, it has three opportunities to determine that the BHT was (or was not) correct. If it is the case that the BHT correctly anticipated the branch, then there is no penalty associated with the branch, otherwise, there may be a severe penalty associated with having "guessed" wrong. The three times at which a BHT error can be discovered are as follows.
The first opportunity is at decode time (DEC) where a "branch wrong guess" (BWG) can manifest itself in one of two ways. First, if the decoder encounters an unconditionally taken branch, and the BHT has given no indication of this branch, then it is known that the BHT is wrong. The appropriate action at this point is to execute the branch in the canonical way, and to create a new BHT entry to indicate the presence of the branch. Second, if the BHT has indicated a taken branch at a given address, and the instuction that is decoded at this address is not a branch instruction, then it is known that the BHT is in error. The appropriate action at this point is to delete the offending entry from the BHT, and to abort the redirection in instruction prefetching that was effected by the presence of the entry. (Note that in this latter case, the BHT may have caused cycles of penalty to be incurred via redirection of instruction prefetch when there was no branch instruction in the code.)
The second opportunity to detect a BWG is at address-generation time (AGEN). A BWG manifests itself if the target address that is generated is not the same as the target address that was predicted (and enqueued at the processor) by the BHT. The appropriate action at this point is to correct the target address in the BHT entry, to abort the instruction prefetching that was directed down the erroneous target path, and to redirect instruction prefetching down the correct target path.
The third and final opportunity to detect a BWG is at execute time (EXEC). The only branches that can possibly cause a BWG at this point are conditional branches, since the resolution of the branch condition is performed during EXEC. A BWG occurs if EXEC determines that the branch is taken when the BHT gave no indication of such, or if EXEC determines that the branch is not taken when the BHT indicated that the branch would be taken. In either case, the appropriate action is to update the BHT to indicate the new action of the branch, and to redirect instruction prefetch in accordance with the new action.
The primary causes for BWG at these three different points are as follows:
(1) BWG arises at DEC for three reasons. First, in code that has never been previously encountered, there is no history available for the code. Thus, unconditionally taken branches are not known to the BHT. There is no way to remove this category of BWG, i.e., if there is no history then there is no way to anticipate the branch. Second, since the BHT is a finite hashed table controlled by some replacement algorithm, valid history can be overwritten by more recently made entries. Third, since the BHT is a finite hashed table, and since (possibly) multiple addresses map into the same BHT entry, "false hits" arise when there is no branch in the current code, but there is a branch instruction at some other address that happens to map into an entry that is shared by the current code. The second and third types of errors can be reduced by making the BHT larger, and by making the hashing function more precise--these solutions are straightforward.
(2) BWG arises at AGEN because there are some subset of branch instructions that do not always branch to the same target address.
(3) Finally, BWG arises at EXEC because the BHT is a history-driven mechanism (i.e., it predicts that a branch instuction will always do what it did the last time), and conditional branches do not always behave in this way. The very fact that a branch is conditional indicates that there are some (possible) conditions that will cause the branch to be taken, and some (possible) conditions that will cause the branch to be not-taken, even if one of the sets of conditions is unlikely.
U.S. Pat. No. 4,763,245 to Emma, et al, assigned to the assignee as the present invention, teaches a mechanism that reduces this last form of error. U.S. Pat. No. 4,763,445 is hereby incorporated by reference.
The branch prediction mechanism taught in the incorporated patent employs a BHT that is updated using an operand sensitive branch table referred to as a "Data Dependent Branch Table" (DDBT). The principle can be illustrated via the following example. Consider a segment of code that is run several times in succession (not necessarily contiguous in time). A history-based mechanism such as the DHT or the BHT will predict that each specific branch instruction within the code will do exactly the same thing (be taken or be not-taken) each time that the code is run. Although this type of guess works extremely well for many of the specific branches within the code, there are some specific branches for which this forms a very bad prediction. If it is not true that a given property of a branch (e.g., whether it is taken) is invariant, then a history-based prediction mechanism cannot be founded on this given property; thus, to build a history-based predictor, it is necessary to identify some property of the branch that is invariant, and to found the predictor on that property.
The DDBT taught in the referenced application is predicted on the typical logical operation of a conditional branch instruction. First, there is some instruction that precedes the branch instruction--this preceding instruction performs an operation (say, a test) on an operand (say, a byte in memory), and it sets a condition-code in the processor based on the outcome of the operation. Next, the branch instruction examines the state of the condition-code in the processor, and it branches (or falls through) based on this state. Thus, if a particular branch instruction is first observed to be taken, and is subsequently observed to be not-taken, then there are only two possible reasons for this change: either the operand that is tested by the preceding instruction has been changed by a store instruction since the last time that the branch was encountered, or the preceding instruction is performing the test on a new operand location (e.g., the address of the operand has changed). The "Data Dependent Branch Table" (DDBT) is predicated on the first of these causes, i.e., it is based on an invariance in operand location.
As described in the patent incorporated by reference, the DDBT is a table that keeps track of those operand locations that are known to affect conditional branch instructions. For each known operand location, the DDBT indicates the address of the branch that is affected, as well as the way in which the branch is affected (e.g., the test that is performed on the operand, and the branch condition) and the most recent history of the branch outcome. Whenever a store operation is performed by the processor, the DDBT is searched to determine whether the location to which the store is directed is one of these operand locations. If this search produces a "hit" (i.e., if such an entry is found), then the new value that is being stored is subjected to the test and condition (from the DDBT) to determine whether the new operand value will change the branch outcome. If the determination is that the branch outcome will be changed as a result of the store, then the corresponding BHT entry is changed to reflect this. When the corresponding branch instruction is subsequently encountered, the BHT will make the correct prediction if the test operand location is in fact invariant, otherwise it cannot be determined whether the prediction will be correct.
The incorporated patent teaches creating a DDBT entry at the time of a BWG on the part of the BHT. That is, the entry is created at the earliest possible time at which it is known that the branch outcome is not invariant. Since no special attention is paid to the branch prior to the BWG, it is not actually known that the operand location that influences the branch is invariant, i.e., the act of creating the DDBT entry is merely a "guess" that the branch can be predicted in this way. Although this guess is correct for many of the branches that are not invariant in outcome, there are some number of branches that vary both in outcome and in test operand location. Test instructions for these branches test operands from new locations each time they are executed, and thus, focusing on one particular location produces branch guesses that are not related to the operands that will actually determine the subsequent outcomes of the branch. Essentially, subsequent branch guesses are random if the branch has this property and the DDBT is attempting to aid the prediction.
Thus, there are some subset of the branches that are irrelevant to the DDBT, nonetheless, the DDBT as described in the incorporated patent will inevitably try to predict them. Further, a large fraction of these branches are fairly (though not perfectly) predictable via the BHT (or DHT). Therefore, once it is known that the DDBT is not an appropriate mechanism for predicting a particular branch, it is desirable to ignore "corrections" issued by the DDBT, and it is desirable to inhibit that particular branch from creating new DDBT entries on the BWG event. Even if the loss in accuracy is tolerable vis-a-vis irrelevant predictions, one effect that these branches have is to create large numbers of (irrelevant) DDBT entries, thereby overwriting many useful entries and impacting the effectiveness of the DDBT with regard to other branches.
A further problem regarding these entries (even if the history table can be made to "ignore" updates that are effected by the entries) is that the BHT (or DHT) must still take the time to ignore them. That is, when an irrelevant update is issued by the DDBT, it causes, for example, the BHT to be searched which in turn generates superfluous traffic in the BHT impacting the timeliness with which the BHT is able to (accurately) direct prefetching. Therefore, it is also desirable to remove offending entries from the DDBT so as to limit the amount of superfluous traffic through a history table.
The very property that makes a particular DDBT entry undesirable is also the property that makes the particular entry difficult to remove. Specifically, since the DDBT must respond to store traffic (i.e., the DDBT is searched to see whether the operand that is being stored affects a given branch), entries are located in the DDBT based on operand address. A DDBT entry is found to be useless when it is discovered that the test instruction (preceding the branch) performs tests on different operand locations. Thus, by the time this discovery is made, the new operand location associated with the branch has no relation to the old operand location (i.e, there is no record associated with the branch that designates the old operand location), yet the offending entry is stored in the DDBT based on the old operand address. I.e., at this point in time it is known that there is an offending entry in the DDBT, but it is unknown where the offending entry is located. This makes it difficult to remove the entry.