Processor architectures have been improved over time to reduce the amount of time required to process program instructions and to speed up the overall execution of programs. One common processor architecture improvement is the incorporation of one or more cache memories on the processor chip itself. A cache memory is a small high speed memory which stores a copy of some of the information, i.e., program instructions and/or data, also stored in the main memory. Unlike the slow main memory, the cache operates at a high speed which can be equal to the processing speed of the processor. Although cache memories only store a smaller amount of information than the main memory, they tend to provide a dramatic speed up in memory access. This is because cache memories tend to exploit the spatial and temporal locality of reference properties of memory access. The spatial locality of reference property is the likelihood of accessing memory locations adjacent to other recently accessed memory locations. Instructions tend to be executed in short sequences, wherein the individually executed instructions in each sequence are stored in the same order in which they are executed. To exploit the spatial locality of reference property, the cache memory is organized so as to store large subsequences of data words, e.g., 16 byte long subsequences referred to as data lines or blocks. When a block containing an instruction is first fetched and loaded into the cache, the likelihood increases that future data accesses can also be satisfied by the recently fetched block. The temporal locality of reference property is the tendency of repeatedly executing certain instruction sequences by virtue of flow control instructions such as loops, subroutines and branch instructions. To exploit the temporal locality of reference property, the cache memory tends to retain each fetched block and preferably only relinquishes (erases) a fetched block if another processor or device desires to write into the data words of the block or if the cache memory runs out of space.
Another technique for increasing processing speed is referred to as "pipelining." In general, the processing of an instruction may require the sequential steps of fetching the instruction, decoding the instruction, fetching the operands of the instruction, executing the instruction and writing back the results of the execution. In a pipelined processor, the processing steps of several instructions are overlapped so as to minimize the delay in executing the instructions in sequence. As an illustration, consider a five stage pipeline with five sequential processing stages for performing the above noted five functions as applied to a sequence of five instructions. Assume that each stage of the pipeline requires one cycle to perform its respective function. Then each of the first, second, third, fourth and fifth instructions are inputted to the pipeline (in particular, the fetching stage of the pipeline) one instruction per cycle. After the fifth instruction is inputted, the first instruction will be in the write back stage, the second instruction will be in the execution stage, the third instruction will be in the operand fetch stage, the fourth instruction will be in the decoding stage and the fifth instruction will be in the fetching stage.
To further increase processing performance, multiple pipeline stages, most notably, execution stages, may be provided which can simultaneously operate on different instructions. Such processors are referred to as superscalar processors. Superscalar processors may incorporate an additional technique in which a sequence of instructions may be executed, and results for such instructions may be stored, in a some what arbitrary and different order than the strictly sequential order in which the instruction sequence is stored. This is referred to as out-of-order issue and out-of-order completion, respectively.
The ability of a superscalar processor to execute two or more instructions simultaneously depends upon the particular instructions being executed. Likewise, the flexibility in issuing or completing instructions out-of-order can depend on the particular instructions to be issued or completed. There are three types of such instruction dependencies referred to as resource conflicts, procedural dependencies and data dependencies. Resource conflicts occur when two instructions executing in parallel contend to access the same resource, e.g., the system bus. Data dependencies occur when the completion of a first instruction changes the value stored in a register or memory that is later accessed by a later completed second instruction.
Data dependencies can be classified into three types referred to as "true data dependencies," "anti-dependencies" and "output data dependencies". See MIKE JOHNSON, SUPERSCALAR MICROPROCESSOR DESIGN, p. 9-24 (1991). An instruction which uses a value computed by a previous instruction has a "true" (or data) dependency on the previous instruction. An example of an output dependency is, in out-of-order completion, where first and second sequential instructions both assign the same register or memory location to different values and a third instruction that follows the first and second instructions uses the value stored in the register or memory location as an operand. The earlier (first) instruction cannot complete after the later (second) instruction or else the third instruction will have the wrong value. An example of an anti-dependency also occurs in out-of-order execution wherein a later instruction, executed out of order and before a previous instruction, may produce a value that destroys a value used by the previous instruction. As illustrations of true dependency, output dependency and anti-dependency, consider the following sequence of instructions:
(1) R3:=R3 op R5 PA1 (2) R4:=R3+1 PA1 (3) R3:=R5+1 PA1 (4) R7:=R3 op R4
Instruction (2) has a true dependency on instruction (1) since the value stored in R3, to be used as an operand in instruction (2), is determined by instruction (1). Instruction (3) has an anti-dependency on instruction (2) since instruction (3) modifies the contents of register R3. If instruction (3) is executed out of order and before instruction (2) then instruction (2) will use the wrong value stored in register R3 (in particular, the value as modified by instruction (3)). Instructions (1) and (3) have an output dependency. Instruction (1) cannot complete out-of-order and after instruction (3) because the resulting value, as determined by instruction (3), must be the last value stored in register R3, not the resulting value as determined by instruction (1), so that instruction (4) will execute on the correct operand value stored in register R3. False dependencies can be removed using a register renaming technique and a reorder buffer.
A procedural dependency occurs where execution of a first instruction depends on the outcome of execution of a previous instruction, such as a branch instruction. See MIKE JOHNSON, SUPERSCALAR MICROPROCESSOR DESIGN, p. 57-77 (1991). It is difficult to know with certainty whether or not a particular branch will be taken. For sake of brevity, it is presumed that a branch is an instruction that either causes execution to continue at some pre-specified non-sequential address or allows execution to continue in sequence at the very next sequentially following instruction. In the former case, the branch is said "to have been taken," wherein in the latter case, the branch is said "to have not been taken." Branch instructions can be more complicated including indexed branch instructions, wherein the address to which the execution continues when the branch is taken dynamically varies according to a value stored in memory or in a register. Therefore, it is difficult to know with certainty which sequence of instructions should be executed after a branch instruction.
Branch instructions provide a problem for pipelined processors because they disrupt the sequential flow of instructions. In particular, for pipelining to function optimally, instructions must be inputted to each pipeline stage one instruction per cycle. However, the outcome of a branch instruction, in particular, whether or not the branch will be taken and to what address execution will branch, cannot always be known until after executing the branch instruction. Absent any special provisions, instructions can not be inputted to the processing pipeline after a branch instruction until after the branch instruction executes. Furthermore, consider that once the branch executes and the required instruction sequence is identified, the required instruction sequence might not be in an instruction cache and must be retrieved from main memory. This incurs a large delay in processing instructions.
To alleviate this problem, a number of branch prediction techniques can be used to predict whether or not a branch will be taken, which techniques can have an accuracy as high as 80%. See U.S. Pat. Nos. 5,163,140, 5,327,547, 5,327,536, 5,353,421, 5,442,756, 5,367,703, 5,230,068 and MIKE JOHNSON, SUPERSCALAR MICROPROCESSOR DESIGN, p. 71-75 (1991). Using a branch prediction technique, a prediction is made as to whether or not a branch will be taken. The sequence of instructions which would be executed if the prediction is correct is fetched and executed. However, any results of such instructions are treated as merely "speculative" until the branch instruction is in fact executed. When the branch instruction is executed, a determination is made as to whether or not the prediction was correct. If the outcome of the branch instruction was correctly predicted, the above-noted "speculative results" may be accepted. However, if the branch was incorrectly predicted, mis-prediction recovery steps are executed including discarding the speculative results and fetching the correct sequence of instructions for execution.
As an example of a branch prediction mechanism, consider the technique disclosed in U.S. Pat. No. 5,442,756. The processor architecture includes two six stage pipelines which each have a prefetch stage, a first decode stage, a second decode stage, an execution stage, a write back stage and a post write back stage. A branch target buffer is provided which operates in parallel with the first decode stage. The branch target buffer has multiple entries. Each entry stores a tag indicative of the address of a branch instruction to which it pertains. For reasons discussed below, the tag is in fact a portion of the address of the instruction which precedes the branch instruction and not the address of the branch instruction itself. Each entry also contains an address field which stores a "target address" or prediction of the address to which execution will branch upon executing the instruction and prediction history information regarding the history or the "takeness" of the branch. Initially, the branch target buffer is empty. When a branch instruction is executed and is taken, information regarding the branch instruction is stored in the branch target buffer. Illustratively, the branch target buffer is organized in a 4-way set associative fashion. The address of the instruction which precedes the branch instruction is therefore divided into an index or "set portion" and a "tag portion." For instance, suppose, each address is 32 bits long and 1k sets having four entries each are provided in the branch target buffer. The most significant ten bits of the address may be the set portion and the least significant twenty-two bits of the address may be the tag portion. The set portion is used to retrieve one of four branch target buffer entries corresponding to the set portion of the branch instruction address. The tag portion is then stored in the tag field of the retrieved entry. The address to which execution branches is stored in the target address field of the retrieved entry. A two bit counter of `11` is stored in the prediction history information to indicate that the branch is "strongly" taken.
Each time an instruction in either pipeline reaches the first decoding stage, its address is searched in the branch target buffer to determine if a branch prediction has been made therefor. The search is performed by accessing those branch target buffer entries corresponding to the same set as the address of the instruction decoded in the decoding stage and then by comparing the tag portion of the accessed entries to the tag portion of the address of the decoded instruction. If there is a match, the target address and prediction history information are retrieved and provided to a prefetching stage. If the prediction history bits are `11` ("strongly taken") or `10` ("weakly taken"), the target address is used to retrieve the next instruction for decoding. If the bits are `01` ("weakly not taken") or `00` ("strongly not taken") the target address is not used, and the instruction that sequentially follows the currently decoded instruction is fetched. Likewise, if no matching entry can be found in the branch target buffer, the instruction is presumed not to be a branch instruction or presumed to be a branch instruction for which the branch is not taken. In such a case, the instruction following the currently decoded instruction is fetched.
After the instruction for which the prediction was made is executed, the prediction is verified. If the branch is taken, the two bit counter is increased by one (or maintained at `11` if already at `11`). If the branch is not taken, the two bit counter is decreased by one (or maintained at `00` if already at `00`). Thus, the prediction history of each branch instruction is updated to reflect how frequently the branch was taken in recent executions of the instruction.
In the architecture described above, the address of the instruction which precedes the branch instruction, and not the address of the branch instruction itself, is used to store branch information in the branch target buffer. The reason for this pertains to the difficulties imposed by the types of instructions which must be executed. Processors can be classified as having either a complex instruction set computer (CISC) architecture or a reduced instruction set computer (RISC) architecture. RISC architecture processors have instructions which are all the same length. On the other hand, CISC architecture processors may have variable length instructions. For example, the x86 processor instruction set has instructions with lengths of 1-12 bytes (assuming that prefix codes are not counted).
In the above described architecture, the length of each variable length instruction is not known until the instruction is decoded in the decoder stage. In order to be able to input the instruction located at the target address (assuming that the branch instruction is predicted to be taken) into the pipeline stage on the cycle immediately after the branch instruction, the length of the branch instruction must be known. To that end, the prediction for the branch instruction is made when the instruction preceding the branch instruction is in the decoder stage--at which time the branch instruction is in the prefetch stage. Thereafter, on the next cycle, the branch instruction enters the decoder stage and its length is determined. This enables inputting the instruction that begins at the target address into the prefetch stage (using the target address determined in the previous cycle). The problem with this technique is that the branch instruction is not always preceded in its sequence by another instruction such as in the case where the branch instruction is the first instruction in the sequence.
A second more important problem with the above technique is that only a single instruction can be checked per cycle to determine if it is a branch instruction and if it is predicted to be taken. Again, this results because there is no advance information regarding the length of each instruction. Rather, the length of each instruction is not determined until the decoding stage. As a result, branch prediction is not performed in a parallel fashion but rather in a serial fashion thereby degrading the performance of a superscalar processor.
FIG. 1 depicts the architecture of the Pentium.TM. processor made by Intel.TM.. Two processing pipelines are provided with five stages, namely, the prefetch, first decode, second decode, execution and write back stages. Branch prediction is performed in the decode stage. Only one branch instruction can be predicted per cycle--the branch target buffer can only determine if the very next branch instruction is taken or not taken. Furthermore, only one of the pipelines can execute conditional branch instructions. The possibility of a branch instruction crossing a cache block is checked and such branch instructions are reconstituted in the prefetch buffer. The penalty for mis-predicting a branch is one cycle.
FIG. 2 shows the architecture of Cyrix.TM.'s M1.TM. processor. Like the Pentium.TM. processor, the M1.TM. performs branch prediction in the decode stage and therefore can only predict one branch per cycle. Likewise, checks for branch instructions, and reconstitution thereof, are performed in the prefetch buffer. Furthermore, conditional branch instructions can only be executed in one of the pipelines.
FIG. 3 shows the architecture of Nexgen.TM.'s RISC86.TM. processor. Unlike the CISC processors, the RISC86.TM. is a RISC processor which uses a variable number of cycles to execute each instruction. This is illustrated in FIG. 4. Branch prediction is performed in the prefetch stage using a merged branch target buffer and instruction cache called a "branch prediction cache." The branch prediction cache has four fields including, a field for storing a branch instruction address, a target address, a branch history counter and a short sequence of instructions of 24 bytes that begins at the target address. A search address is received and is matched against each branch instruction stored in the branch instruction address field. If a matching branch instruction address is identified, the prediction counter associated with the matching branch instruction address is consulted to determine if the branch is taken. If the branch is predicted to be taken, the short instruction sequence of 24 bytes is retrieved and outputted. A shortcoming of this architecture is that only one instruction can be fetched and decoded per cycle. Thus, branch prediction can be performed on only the single instruction fetched per cycle. This architecture therefore does not support the superscalar execution paradigm according to which the processor can perform branch prediction on multiple instructions each cycle. Checks for branch instructions, and reconstitution thereof, are performed in the prefetch buffer. However, there is no penalty (in terms of lost cycles) for branch mis-prediction.
FIG. 5 shows the architecture of American Microdevices.TM. AMD5K86.TM.. In the AMD5K86.TM., predecoder bits indicating instruction boundaries are added to each cache block as it is loaded into the prefetch buffer. In addition, the branch target buffer is merged with the instruction cache. Thus, prediction is performed in the fetch stage. Furthermore, the instruction cache itself can be used to determine the next cache block to fetch. However, no checks are provided for branch instructions which cross cache blocks. No penalty is incurred for branch mis-prediction. Furthermore, the AMD5K86.TM. provides full superscalar support.
FIG. 6 shows the architecture of Intel.TM.'s PentiumPro.TM. processor. The PentiumPro.TM. provides full superscalar support. A separate branch prediction stage is provided before the prefetch stage which performs branch prediction. The branch target buffer can examine up to N data words in a cache block per cycle for a taken branch, where N is the number of data words in a cache block. However, there is a one cycle penalty for mis-predicting a branch instruction.
MIKE JOHNSON, SUPERSCALAR MICROPROCESSOR DESIGN, p. 71-75 (1991) discloses an architecture in which the instruction cache is merged with the branch target buffer in a RISC architecture processor. In particular, "fetch information" is associated with each instruction cache block. The "fetch information" includes, amongst other things, a successor index field and a branch block index field. The successor index field indicates the next cache block to be fetched and the first data word of the first instruction within this to-be-fetched cache block at which execution should begin. If a prediction has been made on a branch instruction within the cache block, the successor index field will correspond to a non-sequential cache block. The branch block index field indicates an end point of an instruction sequence within a cache block (if the instruction sequence ends on a data word within the cache block). In this architecture, the successor index field only contains a trailing portion of the address of the next block to be fetched. Each cache block stores a preceding tag portion of its own address. In other words, the successor index alone is not enough to identify the cache block containing the target address of the branch instruction. Rather, the cache blocks must somehow be sequentially ordered in the instruction cache so that the cache block containing the target address succeeds the cache block containing the branch thereto. In such a case, the tag address portion associated with the succeeding cache block containing the target address can simply be concatenated with the successor index in the preceding cache block containing the branch thereto in order to determine the address of the next instruction (in the succeeding cache block) to be executed.
In operation, each cache block is sequentially fetched from the instruction cache and the instructions therein are inputted to a-successive stage of an execution pipeline. The successor index of a currently fetched cache block is obtained and concatenated with the tag portion of the next sequentially fetched cache block to produce the address of the next instruction to be executed. Each instruction is sequentially inputted to the next stage of the execution pipeline until the instruction stored at the location indicated by the branch index field is reached. At such a point, execution switches to the instruction at the address formed by concatenating the successor index of the current cache block to the tag address of the next cache block in the next cache block.
This architecture enables performing branch prediction on all instructions in a cache block at once. That is, as soon as a cache block is fetched, the next branch instruction predicted to be taken can be immediately identified and the target address therefor can be immediately determined regardless of whether the branch instruction is the first, second, . . . , or last instruction in the cache block. Of course, this is a simple task in the RISC architecture processor in which the proposed scheme is implemented. The proposed technique is much more difficult in a CISC architecture processor where the instructions, in particular, the branch instructions have a variable length. In CISC architecture processors, there is no guarantee that the beginning or end of a block will be aligned with an instruction beginning or end. Rather, instructions may cross multiple blocks.
To better appreciate this problem, consider the scenarios of instruction sequence storage in cache blocks as illustrated in FIG. 7. Assume that each instruction sequence terminates at a branch instruction, which when executed is taken. (Other branch instructions which are not taken may also be contained within the instruction sequence.) Blocks 10 and 12, corresponding to cache block addresses n and n+1, illustrate the situation where the instruction sequence begins in block n but does not terminate in either block n or n+1. In other words, while the instruction sequence begins in block n, it continues beyond block n+1. No branches are predicted to be taken in blocks n or n+1. Blocks 14 and 16 illustrate the situation where the instruction sequence begins in block n and ends in block n+1, where the branch instruction is entirely contained in block n+l. Blocks 18 and 20 illustrate the situation where the instruction sequence begins in block n and ends on a branch instruction occupying one or more data words at the end of block n and one or more data words at the beginning of block n+1. In this situation, the branch instruction that terminates the instruction sequence is said to "cross multiple cache blocks." Finally, blocks 22 and 24 illustrate the situation where the instruction sequence begins on block n and ends at a branch instruction contained entirely within block n.
It is an object of the present invention to perform branch prediction for variable length instructions in the prefetch stage.
It is another object of the invention to accommodate searching for branch instructions which may cross multiple cache blocks and retrieving such branch instructions.
It is yet another object of the invention to efficiently pack instructions of multiple sequences end to end without any gaps in a processor which performs branch prediction in the prefetch stage.
It is an additional object of the invention to provide branch prediction in a processor without impeding the superscalar (parallel) processing capabilities of the processor and to increase the number of data words examined in forming a branch prediction.