The present invention relates to techniques for identifying portions of computer programs that are frequently executed. The present invention is particularly useful in dynamic translators needing to identify candidate portions of code for caching and/or optimization.
Dynamic emulation is the core execution mode in many software systems including simulators, dynamic translators, tracing tools and language interpreters. The capability of emulating rapidly and efficiently is critical for these software systems to be effective. Dynamic caching emulators (also called dynamic translators) translate one sequence of instructions into another sequence of instructions which is executed. The second sequence of instructions are xe2x80x98nativexe2x80x99 instructionsxe2x80x94they can be executed directly by the machine on which the translator is running (this xe2x80x98machinexe2x80x99 may be hardware or may be defined by software that is running on yet another machine with its own architecture). A dynamic translator can be designed to execute instructions for one machine architecture (i.e., one instruction set) on a machine of a different architecture (i.e., with a different instruction set). Alternatively, a dynamic translator can take instructions that are native to the machine on which the dynamic translator is running and operate on that instruction stream to produce an optimized instruction stream. Also, a dynamic translator can include both of these functions (translation from one architecture to another, and optimization).
A traditional emulator interprets one instruction at a time, which usually results in excessive overhead, making emulation practically infeasible for large programs. A common approach to reduce the excessive overhead of one-instruction-at-a-time emulators is to generate and cache translations for a consecutive sequence of instructions such as an entire basic block. A basic block is a sequence of instructions that starts with the target of a branch and extends up to the next branch.
Caching dynamic translators attempt to identify program hot spots (frequently executed portions of the program, such as certain loops) at runtime and use a code cache to store translations of those frequently executed portions. Subsequent execution of those portions can use the cached translations, thereby reducing the overhead of executing those portions of the program.
Accordingly, instead of emulating an individual instruction at some address x, an entire basic block is fetched starting from x, and a code sequence corresponding to the emulation of this entire block is generated and placed in a translation cache. See Bob Cmelik, David Keppel, xe2x80x9cShade: A fast instruction-set simulator for execution profiling,xe2x80x9d Proceedings of the 1994 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. An address map is maintained to map original code addresses to the corresponding translation block addresses in the translation cache. The basic emulation loop is modified such that prior to emulating an instruction at address x, an address look-up determines whether a translation exists for the address. If so, control is directed to the corresponding block in the cache. The execution of a block in the cache terminates with an appropriate update of the emulator""s program counter and a branch is executed to return control back to the emulator.
As noted above, a dynamic translator may take instructions in one instruction set and produce instructions in a different instruction set. Or, a dynamic translator may perform optimization: producing instructions in the same instruction set as the original instruction stream. Thus, dynamic optimization is a special native-to-native case of dynamic translation. Or, a dynamic translator may do bothxe2x80x94converting between instruction sets as well as performing optimization.
In general, the more sophisticated the hot spot detection scheme, the more precise the hot spot identification can be, and hence (i) the smaller the translated code cache space required to hold the more compact set of identified hot spots of the working set of the running program, and (ii) the less time spent translating hot spots into native code (or into optimized native code). The usual approach to hot spot detection uses an execution profiling scheme. Unless special hardware support for profiling is provided, it is generally the case that a more complex profiling scheme will incur a greater overhead. Thus, dynamic translators typically have to strike a balance between minimizing overhead on the one hand and selecting hot spots very carefully on the other.
Depending on the profiling technique used, the granularity of the selected hot spots can vary. For example, a fine-grained technique may identify single blocks (a straight-line sequence of code without any intervening branches), whereas a more coarse approach to profiling may identify entire procedures. A procedure is a self-contained piece of code that is accessed by a call/branch instruction and typically ends with an indirect branch called a return. Since there are typically many more blocks that are executed compared to procedures, the latter requires much less profiling overhead (both memory space for the execution frequency counters and the time spent updating those counters) than the former. In systems that are performing program optimization, another factor to consider is the likelihood of useful optimization and/or the degree of optimization opportunity that is available in the selected hot spot. A block presents a much smaller optimization scope than a procedure (and thus fewer types of optimization techniques can be applied), although a block is easier to optimize because it lacks any control flow (branches and joins).
Traces offer yet a different set of tradeoffs. Traces (also known as paths) are single-entry multi-exit dynamic sequences of blocks. Although traces often have an optimization scope between that for blocks and that for procedures, traces may pass through several procedure bodies, and may even contain entire procedure bodies. Traces offer a fairly large optimization scope while still having simple control flow, which makes optimizing them much easier than a procedure. Simple control flow also allows a fast optimizer implementation. A dynamic trace can even go past several procedure calls and returns, including dynamically linked libraries (DLLs). This ability allows an optimizer to perform inlining, which is an optimization that removes redundant call and return branches, which can improve performance substantially.
Unfortunately, without hardware support, the overhead required to profile hot traces using existing methods (such as described by T. Ball and J. Larus in xe2x80x9cEfficient Path Profilingxe2x80x9d, Proceedings of the 29th Symposium on Micro Architecture (MICRO-29), December 1996) is often prohibitively high. Such methods require instrumenting the program binary (invasively inserting instructions to support profiling), which makes the profiling non-transparent and can result in binary code bloat. Also, execution of the inserted instrumentation instructions slows down overall program execution and once the instrumentation has been inserted, it is difficult to remove at runtime. In addition, such a method requires sufficiently complex analysis of the counter values to uncover the hot paths in the program that such method is difficult to use effectively on-the-fly while the program is executing. All of these factors make traditional schemes inefficient for use in a caching dynamic translator.
Hot traces can also be constructed indirectly, using branch or basic block profiling (as contrasted with trace profiling, where the profile directly provides trace information). In this scheme, a counter is associated with the Taken target of every branch (there are other variations on this, but the overheads are similar). When the caching dynamic translator is interpreting the program code, it increments such a counter each time a Taken branch is interpreted. When a counter exceeds a preset threshold, its corresponding block is flagged as hot. These hot blocks can be strung together to create a hot trace. Such a profiling technique has the following shortcomings:
1. A large counter table is required, since the number of distinct blocks executed by a program can be very large.
2. The overhead for trace selection is high. The reason can be intuitively explained: if a trace consists of N blocks, this scheme will have to wait until N counters all exceed their thresholds before they can be strung into a trace.
Note also that to avoid unnecessary contextxe2x80x94switches upon exiting from the translation cache, the translations that reside inside the cache are directly interconnected whenever possible. Thus, if a basic block translation terminates with a branch that targets another translation in the cache, the branch is patched to directly jump to the other translation.
Briefly, the present invention comprises, in one embodiment, a method for growing a secondary trace out of a cache of translations for a program during the program""s execution in a dynamic translator, comprising the steps of: maintaining execution counts for translation heads that are executed from a code cache; when an execution count for one of said translation heads exceeds a threshold, designated as a hot translation head, beginning a mode of operation in which, as following code translations are executed from the code cache after the execution of the hot translation head, storing in a history buffer information identifying each of said following code translations in sequence; terminating the storing of information in the history buffer in relation to the hot translation head when a termination condition is met; and linking together the translation head and the sequence of following code translations identified in the history buffer to form a larger code translation.
In a further aspect, the present invention comprises the step of identifying code translations that are translation heads and disabling certain branches that connect the translation head to successor translations.
In yet a further aspect of the present invention, the certain branches that are disabled are direct connections to translation heads that are not second level code translations, where the direct connections are those that originate from either a backwards taken branch of a different code translation or from a second-level code translation.
In a yet further aspect of the present invention, the translation heads are identified as either a target of a backwards taken branch from another code translation or a target of a branch that exits from a second-level translation.
In a yet further aspect of the present invention, the information stored in the history buffer about a code translation is a pointer to the code translation and a pointer to the branch that exited from that code translation.
In yet another aspect, the present invention further comprises the step of, when in the mode of operation in which information is stored in the history buffer, prior to execution of each following code translation after the translation head, disabling the outgoing branches from the following code translation to successor code translations in the code cache, thereby assuring that the mechanism for storing information in the history buffer regains control prior to executing each subsequent code translation.
In yet another aspect of the present invention, the mode of operation in which information is stored in the history buffer includes the steps of (1) storing a pointer to the most recently executed following code translation and a pointer to its exiting branch, (2) re-enabling the disabled branch from the most recently executed following code translation to its successor code translation in the cache, (3) disabling the branch to the next-to-execute following code translation, and (4) jumping to the next-to-execute following code translation.
In yet a further aspect of the present invention, the termination condition is met when either (1) the next following code translation to be executed is identical to the first code translation identified in the history buffer, or (2) the history buffer is full, or (3) the next following code translation is a secondary code translation.
In a further embodiment of the present invention, a dynamic translator is provided for growing a secondary trace out of cache translations for a program during the program""s execution in a dynamic translator, comprising: first logic maintaining execution counts for translation heads that are executed from a code cache; second logic for, when an execution count for one of said translation heads exceeds a threshold, designated as a hot translation head, beginning a mode of operation in which, as following code translations are executed from the code cache after the execution of the hot translation head, storing in a history buffer information identifying each of said following code translations in sequence; third logic for terminating the storing of information in the history buffer in relation to the hot translation head when a termination condition is met; and fourth logic for linking together the translation head and the sequence of following code translations identified in the history buffer to form a larger code translation.
In yet a further embodiment of the present invention, a computer program product is provided, comprising: a computer usable medium having computer readable program code embodied therein for growing a secondary trace out of cache translations for a program during the program""s execution in a dynamic translator comprising first code for maintaining execution counts for translation heads that are executed from a code cache; second code for, when an execution count for one of said translation heads exceeds a threshold, designated as a hot translation head, beginning a mode of operation in which, as following code translations are executed from the code cache following the execution of the hot translation head, storing in a history buffer information identifying each of said following code translations in sequence; third code for terminating the storing of information in the history buffer in relation to the hot translation head when a termination condition is met; and fourth code for linking together the translation head and the sequence of following code translations identified in the history buffer to form a larger code translation.