The presence of branch instructions in the instruction stream has long been an obstacle to achieving high performance in a processor or central processing unit (CPU) of an information handling system. Branch instructions select between a first sequential path of instructions and a second branch path of instructions as the subsequent instructions to be processed by the processor. Branch instructions alter the instruction stream's inherent, straight-line flow of control, and typically cause the CPU to stall while those instructions along the branch path are fetched from a cache or from memory.
Branch instructions introduce additional processing delay by the CPU. First, some delay occurs between the time a branch instruction is loaded into an instruction queue (IQ) after being fetched from an instruction cache (IC) and the time when it is decoded and found to be a branch. Second, if an instruction is found to be a conditional branch, there is often additional delay required for the condition on which the branch is based to be computed. After the condition on which the branch is based is computed, the branch is said to be resolved. Finally, after a branch has been resolved, and if it is taken, the IC must be accessed and instructions in the IQ purged and replaced with those instructions which logically follow the branch instruction.
Existing solutions
There are a number of existing solutions for reducing the additional processing delay caused by branch instructions, each of which provides a varying degree of effectiveness. The simplest solution is to do nothing about the additional processing delay and incur the maximum penalty or amount of time to process a branch instruction. This approach is depicted in the cycle time line of FIG. 1. In this approach, a branch instruction is fetched from the IC and stored in the IQ during one or more processing clock cycles of the CPU. Once a branch is in the IQ, it takes some time for the CPU to identify the instruction as a branch and execute it. It may take a cycle for the branch to execute. Executing a branch instruction involves three steps. First, the branch instruction or any other instruction, branch or otherwise, is decoded by the CPU to determine whether the instruction is a branch instruction. Second, if the instruction is a branch instruction, then the outcome of the branch instruction is resolved or predicted. Determining the outcome of the branch instruction means determining whether a branch instruction is "taken." A branch instruction is said to be "taken" when the second branch path of instructions is selected, thereby altering, or jumping from the first sequential or straight-line path of instructions. If a branch instruction is "not taken," then the instruction stream continues along the first sequential path of instructions. If a branch instruction cannot be resolved, i.e., the outcome cannot be determined, due to an unfulfilled condition, then the outcome of the branch instruction must be predicted. Finally, the target address of the branch instruction, i.e., the address of the instruction to which the branch instruction branches or jumps to (the branch instruction's logical successor), is calculated. The target address is the address of the first instruction in the second branch path of instructions. After branch instruction execution, the branch instruction's logical successor may be fetched and placed in the IQ in the subsequent cycle. If it is determined that the branch has been mispredicted, there are often additional penalty cycles before instructions along the correct path can be fetched and placed in the IQ. Because of its large branch instruction penalty, this scheme is not used in any of today's high-performance processors.
To avoid waiting a cycle for the IQ to be filled after a mispredicted branch, both of a branch's possible paths may be fetched simultaneously as depicted in FIG. 2. Once a branch is detected in the IQ, the branch's predicted outcome is determined and instructions along the expected branch path are placed in the IQ while those instructions along the alternate path are placed in an alternate path buffer. If the branch is predicted correctly, then the behavior of this scheme is similar to that of the scheme depicted in FIG. 2. However, if the branch is mispredicted, then fetching both branch paths in parallel allows the IQ to be filled soon after the misprediction is detected by taking the instructions from the alternate path buffer. The main problem with this approach is that it requires an instruction cache port for each branch path, which also limits the number of branches which may be executed in a cycle. Furthermore, fetching alternate branch paths increases design complexity because several alternate path buffers are required to store the large number of alternate paths resulting from multiple outstanding predicted conditional branches. Additional complexity results from the need to load the IQ from any one of these alternate path buffers.
Another approach makes use of a Branch Target Address Cache (BTAC). The cycle time line of this approach is depicted in FIG. 3. The BTAC, which is accessed in parallel with the instruction cache, provides a guess of the address of the instruction to be fetched during the subsequent cycle. Only entries for branches which are known to be taken or predicted to be taken during their next execution are stored in the BTAC. A set of tags in the BTAC determines whether or not a given fetch address is contained in the BTAC. If a BTAC access is a miss, it is assumed that the current set of instructions being fetched from the instruction cache contains either no branches at all, or one or more not-taken branch instructions. If the BTAC access hits, the resulting address is used as the instruction fetch address during the subsequent cycle. This technique avoids a delay between the time a taken branch instruction is loaded into the IQ and the time at which the instructions in the branch path are available. If the subsequent fetch address is mispredicted, it will take one or more additional cycles for the correct instructions to be fetched from the instruction cache. BTAC-based designs suffer from increased design complexity, an increased demand for space for the BTAC array, performance penalties when the BTAC mispredicts outcomes, and difficulty with the BTAC update policy.
In each of the three approaches discussed above, the three steps in the executing of a branch instruction are performed sequentially or in series. That is, first, the instruction is decoded to determine whether the instruction is a branch instruction, and after this is determined, the branch instruction is predicted if unresolved. After the branch instruction is resolved or predicted, the target address of the branch instruction can be calculated so that the branch instruction's logical successor can be calculated.
In the related patent application, of common assignee herewith, having Ser. No. 08/754,377, an apparatus and method is disclosed which performs the decoding, predicting, resolving, and target address calculating in parallel or during the same time so that a plurality of addresses for the logical successors of branch instructions are provided in parallel to the instruction cache. These addresses, also known as potential fetch addresses, are calculated in the early part of a cycle and the instruction cache is accessed in the latter part of the cycle. Before a directory (a real address directory) associated with the instruction cache can be accessed, the potential fetch addresses must be generated and one of them must be selected since the directory is single ported, i.e., only has a single read port. The potential fetch addresses must also be translated by a single-ported translation shadow array which converts the potential fetch addresses from effective addresses to real addresses before one of the fetch addresses are selected. Thus, in each of the three approaches discussed above, accessing of the instruction cache with the fetch address to fetch a branch instruction's logical successor is performed sequentially after execution of the branch instruction. The time it takes to select one of the potential fetch addresses is longer than the time it takes to generate the potential fetch addresses. The critical path through a set associative instruction cache is generally through a directory to a compare circuit that selects which bank will be selected out of the cache. This critical path is generally referred to as the "late select path." What is needed is an apparatus and method for minimizing the late select path, i.e., decreasing the time in which a bank is selected out of a set associative instruction cache for the instruction fetch unit shown and described in the related patent application, Ser. No. 08/754,377 which is incorporated herein by reference.