Contemporary high-performance rely on superscalar, super-pipelining, and/or very long instruction word (VLIW) techniques for exploiting instruction-level parallelism in programs; that is, for executing more than one instruction at a time. In general, these processors contain multiple functional units, execute a sequential stream of instructions, are able to fetch from memory more than one instruction per cycle, and are able to dispatch for execution more than one instruction per cycle subject to dependencies and availability of resources.
The pool of instructions from which the processor selects those that are dispatched at a given point in time is enlarged by the use of out-of-order execution. Out-of-order execution is a technique by which the operations in a sequential stream of instructions are reordered so that operations appearing later are executed earlier if the resources required by the operation are free, thus reducing the overall execution time of a program. Out-of-order execution exploits the availability of the multiple functional units by using resources otherwise idle. Reordering the execution of operations requires reordering the results produced by those operations, so that the functional behavior of the program is the same as what would be obtained if the instructions were executed in the original sequential order.
The efficiency of out-of-order issuing strategies depends to a large degree on the number of instructions which are available and ready to be issued. Thus, the likelihood of exploiting the available execution resources of the different functional units is highest when a large number of instructions are ready. To enlarge the pool of ready instructions state-of-the-art processors use several predictive and prefetching techniques, such as branch prediction, branch target prediction, caching and prefetching techniques, value prediction, store-load bypassing etc.
Predictive techniques involve hardware, software, or a combination of both. Hardware parts of such implementations are usually referred to as predictor units. These typically comprise at least one history table capturing events of the execution characteristics of program, logic to determine a likely future event (i.e., make a prediction) based on the execution history stored in said history table, and a trigger which causes a prediction to be made. Triggers can include, but are not limited to, e.g., a tag match circuit to determine whether one or more bits in the program counter match one or more bits of at least one tag stored in a history table, or a decode circuit to determine if a prediction should be made based on instruction type, such as performing a branch prediction when a branch is encountered.
Software implementations of predictive techniques are usually implemented in the compiler based on execution profiles collected during sample executions of a program. The compiler then uses this execution information to guide optimizations, and to adapt the program code by selecting different optimizations.
In a hybrid hardware/software scheme, the compiler annotates the program with information about the program behavior, e.g., branch prediction outcomes can be communicated with branch instructions such as “branch with high probability”, “branch with low probability”, or “branch with unknown/hard-to-predict probability”. During execution the processor can then use these annotations to guide instruction execution.
While these techniques have generally been very successful in increasing the instruction level parallelism (ILP) which can be extracted from programs, programs still achieve only a fraction of their peak instruction throughput on typical programs. This is due to performance degrading events, such as branch mispredictions and cache misses. Performance degrading are concentrated in a small number of static instructions which are not amenable to the current branch prediction and caching strategies.
In “Optimizations and Oracle Parallelism with Dynamic Translation”, Proc. of the 32nd International Symposium on Microarchitecture, November 1999, Ebcioglu, et al evaluate the performance potential of microprocessors when perfect prediction is available. Abraham, et al, “Predictability of Load/Store Instruction Latencies”, Proc. of the 26th International Symposium on Microarchitecture, December 1993, demonstrates that a fraction of static instructions are responsible for the majority of cache misses. Zilles and Sohi analyze the instructions which lead up to performance degrading events in “Understanding the Backward Slices of Performance Degrading Instructions”, Proc. of the International Symposium on Computer Architecture, 2000.
The problems of efficient program execution are known in the computer architecture arts and there is a sizable scientific and patent literature dealing with such issues. A sampling of relevant related art in reducing the impact of performance degrading operations is now set forth.
The article by Annavaram, et al “Data Prefetching by Dependence Graph Precomputation”, Proc. of the International Symposium on Computer Architecture, 2001, describes the prefetching of data into a cache by exploiting address precomputation. To identify accurately the addresses of data which should be prefetched, the authors describe the computation of the dependence graph of the address generation.
Srinivasan et al, “Locality vs. Criticality”, Proc. of the International Symposium on Computer Architecture, 2001, attempts to reduce effective cache miss penalty by identifying performance-critical load instructions and maintain them in a special cache to reduce latency which cannot be covered by instruction scheduling.
U.S. Pat. No. 5,864,341 entitled “Instruction dispatch unit and method for dynamically classifying and issuing instructions to execution units with non-uniform forwarding” issued on 15 Jan. 26, 1999 to Hicks et al. describes an apparatus for dynamically classifying and issuing instructions to execution units with non-uniform forwarding. According to this invention, instructions are maintained in a single issue queue, but classified into “buckets” corresponding to different sets of functional units to which instructions can be issued. The apparatus described thus allows to classify operations in terms of different functional units to which they can be issued, and prioritize issuing of instructions when multiple buckets can issue to the same functional units.
Presently known techniques, as the above cited sampling shows, none address the full range problems associated with efficient execution of workloads, and none teaches the present invention. For instance, U.S. Pat. No. 5,864,341 classifies instructions, but in that scheme instructions cannot be classified in terms of their priority relative to each other, to ensure that high-priority instructions (e.g., such instructions leading up to a performance degrading event) are issued before lower-priority instructions. Most proposed implementations require significant amounts of hardware to pre-execute the address computations in specialized pre-execution hardware while already available execution resources may be idle. Such solution also dissipate unnecessary amounts of power. Many branch prediction schemes address the issue of performance degrading events by increasing the amount of resources applied to the problem. This results in more complex hardware which requires additional area, and more design and validation resources while it is not clear that they will be sufficiently successful in reducing the impact of performance degrading events. Accordingly, it follows that increasing the resources devoted to probabilistic components of caches and branch predictors is insufficient to solve the issue of performance degrading events sufficiently. Also, while pre-execution addresses these issues, current solutions based on using separate pre-execution function units, or the use of multithreading capabilities to speculatively assist a main execution thread are overly expensive. What is needed is a method to reduce the impact of performance-degrading events by pre-executing their backward slices without incurring substantial hardware and execution time overhead.