The present invention relates to techniques for identifying portions of computer programs that are particularly frequently executed. The present invention is particularly useful in dynamic translators needing to identify candidate portions of code for caching and/or optimization.
Dynamic translators translate one sequence of instructions into another sequence of instructions which is executed. The second sequence of instructions are xe2x80x98nativexe2x80x99 instructionsxe2x80x94they can be executed directly by the machine on which the translator is running (this xe2x80x98machinexe2x80x99 may be hardware or this machine may be defined by software that is running on yet another machine with its own architecture). A dynamic translator can be designed to execute instructions for one machine architecture (i.e., one instruction set) on a machine of a different architecture (i.e., with a different instruction set). Alternatively, a dynamic translator can take instructions that are native to the machine on which the dynamic translator is running and operate on that instruction stream to produce an optimized instruction stream. Also, a dynamic translator can include both of these functions (translation from one architecture to another, and optimization).
Caching dynamic translators attempt to identify program hot spots (frequently executed portions of the program, such as certain loops) at runtime and use a code cache to store translations of those frequently executed portions. Subsequent execution of those portions can use the cached translations, thereby reducing the overhead of executing those portions of the program.
A dynamic translator may take instructions in one instruction set and produce instructions in a different instruction set. Or, a dynamic translator may perform optimization: producing instructions in the same instruction set as the original instruction stream; thus, dynamic optimization is a special native-to-native case of dynamic translation. Or, a dynamic translator may do bothxe2x80x94converting between instruction sets as well as performing optimization.
In general, the more sophisticated the execution profiling scheme, the more precise the hot spot identification can be, and hence (i) the smaller the translated code cache space required to hold the more compact set of identified hot spots of the working set of the running program, and (ii) the less time spent translating hot spots into native code (or into optimized native code). Unless special hardware support for profiling is provided, it is generally the case that a more complex profiling scheme will incur a greater overhead. Thus, dynamic translators typically have to strike a balance between minimizing overhead on the one hand and selecting hot spots very carefully on the other.
Depending on the profiling technique used, the granularity of the selected hot spots can vary. For example, a fine-grained technique may identify single blocks (a straight-line sequence of code without any intervening branches), whereas a more coarse approach to profiling may identify entire procedures. Since there are typically many more blocks that are executed compared to procedures, the latter requires much less profiling overhead (both memory space for the execution frequency counters and the time spent updating those counters) than the former. In systems that are doing program optimization, another factor to consider is the likelihood of useful optimization and/or the degree of optimization opportunity that is available in the selected hot spot. A block presents a much smaller optimization scope than a procedure (and thus fewer types of optimization techniques can be applied), although a block is easier to optimize because it lacks any control flow (branches and joins).
Traces offer yet a different set of tradeoffs. Traces (also known as paths) are single-entry multi-exit dynamic sequences of blocks. Although traces often have an optimization scope between that for blocks and that for procedures, traces may pass through several procedure bodies, and may even contain entire procedure bodies. Traces offer a fairly large optimization scope while still having simple control flow, which makes optimizing them much easier than a procedure. Simple control flow also allows a fast optimizer implementation. A dynamic trace can even go past several procedure calls and returns, including dynamically linked libraries (DLLs). This allows an optimizer to perform inlining, which is an optimization that removes redundant call and return branches, which can improve performance substantially.
Unfortunately, without hardware support, the overhead required to profile hot traces using existing methods (such as described by T. Ball and J. Larus in xe2x80x9cEfficient Path Profilingxe2x80x9d, Proceedings of the 29th Symposium on Micro Architecture (MICRO-29), December 1996) is often prohibitively high. Such methods require instrumenting the program binary (inserting instructions to support profiling), which makes the profiling non-transparent and can result in binary code bloat. Also, execution of the inserted instrumentation instructions slows down overall program execution and once the instrumentation has been inserted, it is difficult to remove at runtime. In addition, such method requires sufficiently complex analysis of the counter values to uncover the hot paths in the program that such method is difficult to use effectively on-the-fly while the program is executing. All of these make traditional schemes inefficient for use in a caching dynamic translator.
Hot traces can also be constructed indirectly, using branch or basic block profiling (as contrasted with trace profiling, where the profile directly provides trace information). In this scheme, a counter is associated with the taken target of every branch (there are other variations on this, but the overheads are similar). When the caching dynamic translator is interpreting the program code, it increments such a counter each time a taken branch is interpreted. When a counter exceeds a preset threshold, its corresponding block is flagged as hot. These hot blocks can be strung together to create a hot trace. Such a profiling technique has the following shortcomings:
1. A large counter table is required, since the number of distinct blocks executed by a program can be very large.
2. The overhead for trace selection is high. The reason can be intuitively explained: if a trace consists of N blocks, this scheme will have to wait until N counters all exceed their thresholds before they can be strung into a trace. It does not take advantage of the fact that after the first counter gets hot, the next N-1 counters are very likely to get hot in quick succession, making it unnecessary to bother incrementing them and doing the bookkeeping of the past blocks that have just executed.
According to the present invention, traces are identified as hot on a speculative basis, rather than based on full trace profile data. The series of blocks beginning at a hot start-of-trace condition and continuing until an end-of-trace condition is identified as a hot trace. Such a trace is identified as hot without the need to incur the overhead of actually measuring whether successive blocks have been executed a sufficient number of times to be considered hot.
The identification of what constitutes the trace is accomplished as the trace is executed. A translation of the trace is emitted as the trace is being executed, is available for optimization in a system that performs optimization, and is captured in the code cache.
A particularly useful start-of-trace condition is when the last interpreted branch was backward taken. A useful end-of-trace condition is when one of the following three conditions occurs: (1) the last interpreted branch was backward taken, (2) the number of interpreted branches exceeded a threshold value, or (3) the number of native instructions emitted for the trace exceeded another threshold value.
Thus, according to the present invention, rather than use higher overhead, sophisticated profiling techniques for identifying program hot spots at runtime, profiling need only be done at certain well-defined program addresses, such as the targets of backward taken branches. When such an address gets hot (i.e., its associated counter exceeds a threshold), the very next sequence of executed blocks (or trace) is speculatively chosen as a hot path.
This scheme speculatively selects as a hot trace the very next sequence of interpreted blocks following certain hot branch targetsxe2x80x94in particular, certain branch targets that are likely to be loop headers. Even though this scheme does not involve elaborate profiling, the quality of the traces selected by this technique can be excellent. One can understand why this scheme is effective as follows: sequences of hot blocks are very often correlated; entire paths tend to get hot in a running program, rather than a disconnected set of blocks.
The present invention provides a mechanism for trace selection with reduced profiling overhead.
Another advantage of the present invention is that a trace can be constructed even when it contains indirect branches (branches whose outcomes are known only when the branch is executed, and which cannot be determined by simply decoding the branch instruction). In contrast, it is awkward for trace growing schemes that rely on branch prediction information to deal with indirect branches, because there is no simple way to predict the outcome of such branches.
A further advantage of the invention is that the memory required for the storage of counters is smaller compared to traditional profiling schemes based on branch or basic block counting, because, with the present invention, it is not necessary to keeping track of counts for each block or for each branch.