To understand this invention, it is essential that the basic block characteristic of all computer programs be understood as background. Programs .are linearly stored in a storage hierarchy, including in a processors main memory, which causes all instruction locations therein to have a linear characteristic. However, when these instructions are executed by the processor, the processor is required by the branch instructions in the program to use nonlinear sequencing of these same instructions linearly obtained from the hierarchy. Thus, the execution sequence of all computer programs is determined by the branch instructions contained in each program. The execution of each program breaks up the program into basic blocks, each starting with a target instruction of a branch instruction and ending with a branch instruction which provides the target address beginning the next basic block in the instruction execution sequence of the program. Any basic block may contain any number of instructions, from one instruction (a branch instruction) to a very large number of instructions (such as thousands of instructions). The processor fetches instructions fastest if they are sequentially located in memory, and slowest if the instruction is the target of a branch, requiring the processor to intervene and perform a target address calculation and often go elsewhere in memory to find the instruction, such as obtaining page faults that greatly slow the processing. Often, short basic blocks are responsible for great degradation in the performance of a processor.
An abundance of conditional branches in programs impede instruction fetching mechanism of modern processors. To support execution of significantly larger number of instructions per cycle, future microprocessors will have to support speculative fetching and execution of large number of instructions. Method described here can speculatively fetch instructions across multiple conditional branches (with different target addresses) in each cycle, based on dynamic multilevel branch predictions and code reorganization during compilation or instruction cache loading.
Over the last decade, microprocessor performance has increased at the rate of about 60% per year. To keep up the performance improvement rates, future microprocessors will have to execute (and commit) significantly larger number of instructions per cycle. Non-numerical workloads are ladened with conditional branches which makes the superscalar processor implementation difficult. In one study, it has been shown that average C programs have about 18% of their instructions being conditional branches and C++ pro grams have about 11% of their instructions being conditional branches (on RS/6000 platform). These limit the sizes of the basic blocks to only 6 to 10 instructions. These necessitates speculative execution beyond basic blocks.
Most processors employ sophisticated branch prediction mechanism to predict the path to be taken by a conditional branch and its target address. However, these are used only to predict the outcome of the next conditional branch to be executed. Due to small sizes of the basic blocks and the increased need for speculation, future microprocessors will have to predict the outcome of multiple branches in a single cycle with high degree of accuracy and be able to fetch instructions from the target addresses of these branch instructions in a single cycle.
To give a better idea, many path-based correlated dynamic branch prediction algorithm can provide branch prediction with accuracy as high as 97% for many non-numerical workload (such as SPECint). With such a high accuracy, predicting the outcome of four successive conditional branches can be made with an accuracy of 88.5%. Similarly, three successive conditional branches can be predicted with 91.3% and two successive conditional branches can be predicted with 94.1% accuracy. This means that, with an average basic block size of 6 instructions, the expected number of instructions speculatively executed that end up being in the taken path is 28.3 with four level of branch prediction (versus only 11.8 instructions with a single level of branch prediction).
As the branches gets further away from the current point of execution, the accuracy with which they can be predicted decreases. Ability to fetch a large number of instructions can greatly enhance the number of instructions that can be executed in a given cycle within the constraints of data, control and structural hazards.
Each computer program is comprised of a set of instructions stored in a permanent memory of a computer system in storage-location sequence, which may also be represented by a virtual-address sequence to a processor fetching the program for executing it. The fetched sequence of instructions is resolved by the processor into an instruction execution sequence which usually differs significantly from the program's storage-location sequence due to branch instructions in the programs.
Thus, programs are generally stored in a permanent computer memory (such as a hard disk) in their virtual-address sequence, which is generally used by processors to transfer portions of a program (such as in page units) into the computer system's random access memory, and then a processor fetches lines of the program in the memory using the virtual-addresses of the program for execution by the processor. The fetched lines often are put into an instruction cache in the executing processor.
Therefore, the instruction execution sequence of a program in the processor does not occur in the program's instruction compiled sequence, wherein the instruction execution sequence is determined by the branch instructions executed in each program, resulting in the program's architected instruction execution sequence. Not pertinent to this invention is the so-called out-of-sequence instruction execution found In some complex processors which nevertheless must follow the program's architected instruction execution sequence.
Each program architected execution sequence of instructions is dependent on the data used by a particular execution of the program, and varying data may unpredictably vary a program's execution sequence of instructions. The data will often control whether a branch instruction is taken or not taken. A not-taken branch instruction generates a target address to the next instruction in the virtual-address sequence of the program. A taken branch instruction generates a target address to a non-sequential instruction at any virtual address.
The instruction fetch mechanisms in all processors operate fastest as long as they are accessing sequential instructions in the virtual address sequence, because then each next instruction address is generated by merely incrementing the current instruction address to generate the next sequential instruction address, which often is the next sequential instruction in the same line in the processor cache (cache hit) from which it can be immediately provided to the processor for execution.
However, a taken branch may involve the additional overhead of having to fetch the target instruction at a location which is not currently in the cache (a cache miss), which then initiates a fetch cycle for copying another line containing that target instruction from the memory into the cache. It is well known that taken branch instructions slow down the fetch rate of instructions needed for program execution in any processor, due to the taken branch instructions deviating from the virtual-address sequence of instructions in a program.
Consequently program execution is slowed by taken branch instructions when their target instructions are not fetched until the actual target address is actually known, which involves delays for fetching additional lines into the processors I-cache. Thus extra overhead is involved in the processor handling of taken branch instructions, since it introduces additional processing delays during which the processor waits for their target instructions to be received by its execution pipeline in this manner, taken-branch-instruction handling slows the processing of programs to increase the execution time of the programs.
In the prior art, each taken branch instruction initiates a fetch cycle, during which a processor fetches one or more basic blocks. A basic block is comprised of one or more instructions having sequential addresses in memory (real or virtual) in which the last instruction is a branch-type of instruction. A branch-type of instruction is a conditional or unconditional branch or a return instruction or a call instruction. The target address of the branch type of instruction ending a basic block starts a next basic block.
The prior Collapsing Buffer approach has several disadvantages to determine the instructions within a cache line that should be fetched. First, it requires a highly interleaved (equal to the number of instructions in the I-cache line) branch target buffer (BTB). Second, it needs to have entry in the BTB for all the branches in the cache line from which instructions are being fetched. Third, to create a bit vector representing which instructions should be fetched from within the cache line, it has a series of address comparators (comparing the four plus the previous instructions address with the target address of the last branch instruction). The number of the comparators in the series is equal to the number of instructions within the cache line. This can significantly reduce the processor clock.
Branch Address Cache (BAC) has been proposed, where each entry stores a portion of the control flow graph. BAC extends the BTB so that instead of a single branch target, the target and fall-through addresses of multiple branches are stored. Branch targets and fall-through paths are filled in during execution if there is no entry found in the BAC for a fetch-address. However, there could be holes in the entry due to not having executed some of the branches. However the paper does not give good description of how to eliminate the intervening instructions between a branch and its target that are unused. An approach similar to the collapsing buffer is needed for this purpose.
The approach presented here does not have any of these drawbacks. Moreover, it uses the hints generated by the compiler or the cache reload logic, to achieve high bandwidth fetching across several branches. Compiler-based approach is expected to produce better results, since the compiler can see the entire subprograms during code generation, encoding of all the path information can be obtained more accurately and so there will not be any hole in the branch target information that is kept in the fetch history table.
A prior "trace cache" technique for controlling the processor fetching of instructions is proposed in an article by E. Rotenberg, S. Bennett and J. Smith entitled "Trace Cache: a Low Latency Approach to High Bandwidth Fetching", Apr. 11, 1996. In the Rotenberg et al article, a trace cache operates with a "core fetch unit" containing an instruction cache (I-cache). The core fetch unit also contains a branch target buffer (BTB), BTB logic, and a multiple branch predictor. The core fetch unit uses a sequence of fetch cycles to fetch instructions from main memory into its I-cache.
Each fetch cycle may include one or more basic blocks in a currently predicted program path from the branch predictor. The fetch cycle includes all instructions in a path matching a previous path traced and stored in a trace cache line associated with the same program address in the program. The current fetch cycle ends whenever it mismatches with the path stored in the addressed trace cache line. The information representing the current trace cache line is stored in a trace tag cache directory entry associated with the trace cache line.
The trace tag cache is made up of a trace buffer, a trace tag directory, and a trace line-fill buffer and logic. The trace tag directory contains control information.
The length of a trace is limited in two ways--by a number of instructions n and by a number of basic blocks m, of which n is limited by the peak dispatch rate of the processor, and m is limited by the average number of branch predictions per fetch cycle. The control information in each entry in the trace tag directory is: a valid bit, a tag field containing a starting address, a branch flags field having (m-1) bits in which each bit represents a branch within the trace to indicate the path followed after each branch instruction (taken or not taken), a branch mask field for indicating: (1) the number of branches in the associated trace, and (2) whether or not the trace ends in a branch, a trace fall-through address field containing the address of the next fetch if the branch is not taken and a trace target address field alternatively containing the next fetch address if the branch is taken.
All 16 entries in the Rotenberg et al BTB operate in parallel with the 16 instructions in a selected trace tag cache line to check each instruction therein for being a branch instruction. The branch prediction logic uses a Global Address correlated branch predictor (GAg) and a single pattern history table. BTB logic combines BTB hit information with the branch predictions to produce the next fetch address and to generate a valid instruction bit vector.
The Rotenberg et al trace cache, BTB, and instruction cache are all accessed in parallel while a multiple branch prediction is made by the predictor logic. A trace cache hit requires for the current trace directory entry: (1) the actual: fetch address match the tag field, and (2) the branch prediction match the branch flag field. On a trace cache miss, fetching proceeds normally in the conventional manner from the I-cache without using trace cache information. However, during this conventional fetching process, a trace cache entry is generated and put into the trace cache, and a corresponding trace cache directory entry is generated. As instructions are conventionally fetched into the I-cache, each basic block being transferred into an I-cache line is also being transferred from the I-cache into a line-fill buffer until m number of basic blocks or n number of instructions (equal a full cache line) are stored in the line-fill buffer. Then, the content of the line-fill buffer is transferred into the current line in the trace cache obtained by the current fetch address. Simultaneously, a corresponding trace directory entry is being generated by the generation of its branch flag, branch mask, and either its fall-through address or its target address as required for the next fetch cycle.
A downside of a simple trace cache is that for each starting fetch address, only a single trace entry can be stored in the trace cache and a single corresponding entry is stored in the trace directory. Hence, different paths from the same fetch address in a program require different trace cache entries and different corresponding trace directory entries; this can result in a large number of trace entries for each fetch address in a program in which different paths are followed from the address during a program execution. This severely restricts the efficiency of the trace cache due to its inefficient use of the trace cache and trace directory entries, since programs often branch back to a prior-executed instruction and follow a different path therefrom in subsequent iterations through the program.
Thus, the Rotenberg et al system requires a trace cache with an I-cache. The subject invention does not use a trace cache and does not use the system organization found in Rotenberg et al.