1. Field of the Invention
This invention relates generally to the selection of code regions for caching in caching dynamic translator.
2. Description of the Related Art
Dynamic translators are used for directly executing non-native program binaries on a processor. Dynamic translators operate by translating each word of the non-native code into a corresponding word or words of native code for execution by the processor. Dynamic translators are related to binary translators, just-in-time compilers and runtime translators. Translation is generally performed at the time the non-native code word is executed. However, there is a significant level of overhead in performing the translation from non-native to native code. Similar methods can be applied to the optimization of native binary code where optimized native code is generated and executed in place of non-optimized native code.
In order to improve performance, native code translations of frequently executed regions are typically kept in a translated code cache. Subsequent execution references to the non-native code words of these translated regions then execute in the corresponding region of the translated code cache, thus avoiding the overhead of emulation.
FIG. 1 illustrates an architecture for an embodiment of a caching dynamic translator 10. A memory image 22 of the non-native code is stored in memory 20. During execution, each word of the non-native code 22 is read out of memory 20 by an interpreter 30 which emulates the non-native binary code on a native processor for execution. Alternatively, the interpreter may read several words from memory, translate them into native code, and then output the translated code to a native processor for execution. In FIGS. 1 and 2, the arrow indicating xe2x80x9cnative binaryxe2x80x9d going xe2x80x9cto native processorxe2x80x9d represents a combination of the execution of the interpreter program with execution of translated code.
The translated native binary code is also typically stored into a translated code cache 50. When control section 32 of interpreter 30 detects a cache hit for an instruction in translated code cache 50, the translated version of the interpreted native binary is output from the cache for execution by the native processor.
A region selector 40 is often included which manages the content of the code cache 50 and determines which segments of translated code remain in the code cache 50. Subsequent references to the non-native code image 22 will execute the corresponding native code in code cache 50 provided that the corresponding native code has not been replaced.
The region selector 40 typically receives runtime profile data from the interpreter 30 which the region selector uses in selecting regions of translated code that are maintained in the translated code cache. Judicious region selection can improve the hit rate in the translated code cache, but at the cost of higher overhead. The tradeoff between hit rate and selection overhead is a critical part of dynamic translator design.
Existing implementations of dynamic translators use either runtime profile data, such as statistical PC sampling or branch profiling, or call invocation counting in order to identify frequently executed regions of the non-native code. The problem with such methods is that it is hard to trigger an action based on execution rate (i.e. how often a region is executed within a certain time interval); it can only be triggered based on execution count (how many times a region has executed thus far). Another problem is that it is difficult to dynamically adjust the degree of profiling done on different program regions because heavy profiling of a very hot region can hurt performance due to the overhead associated with profiling, whereas it may be inconsequential on a cold region.
For example, the SELF system (described by U. Holzle in xe2x80x9cAdaptive optimization for SELF: Reconciling High Performance with Exploratory Programmingxe2x80x9d, PhD Thesis, Stanford University Dept. of Computer Science, August 1994) generates unoptimized native code for a procedure upon first invocation of the procedure, with the procedure prologue containing instrumentation to count the number of invocations. If a counter exceeds a threshold, the corresponding routine is flagged as hot (i.e. it has reached an activity threshold) and, in the case of the SELF system, the hot routine is dynamically re-optimized along with other routines in the call chain.
In the SELF system, an exponential decay technique for region selection is used, wherein the system is periodically interrupted and all the counters corresponding to the cached routines are halved. This attempts to convert the counters into measures of invocation rates rather than invocation counts.
The runtime profile of a program is used in dynamic translators to focus analysis on those parts of the executing program where greater performance benefit is likely. A runtime profile is a collection of information indicating the control flow path of a program, i.e. which instructions executed and where branches in the execution took place. Program profiling typically counts the occurrences of an event during a program""s execution. The measured event is typically a local portion of a program, such as a routine, line of code or branch. Profile information for a program can consist of simple execution counts or more elaborate metrics gathered from hardware counters within the computer executing the program.
One conventional approach to profiling is to instrument the program code by adding profiling probes to the code. Profiling probes are additional instructions which are used to log the execution of a basic block of code containing the probe.
Instrumentation based methods for gathering profile data tend to be complex and time consuming. Instrumentation of the code can result in a code size explosion due to the added instructions. The additional probe instructions also slow execution of the code and a profiled, or instrumented, version of a program can run substantially slower than the original version. Thus, profiling can represent a significant level of overhead in the execution of a program.
Therefore, the need remains for a method of selecting regions for dynamic translation into a code cache which has limited overhead and increases the time spent executing from the code cache.
It is, therefore, an object of the invention to provide a method for selecting active code segments in an executing program having low overhead.
Another object of the invention is to enable dynamic optimization of the code while the code is executing.
An embodiment of a method for selecting active code segments in an executing program, according to the present invention, involves creating a branch history entry for a series of executed code segments, wherein each branch history entry includes a start address and branch history value of one of the segments, storing each branch history entry in a trace buffer, and incrementing a counter corresponding to the start address for each branch history entry in the trace buffer responsive to a selection processing signal. The method then calls for identifying as a hot trace each branch history entry having a start address value with a corresponding counter value which exceeds a threshold, translating the program code segment corresponding to each hot trace into a translated code segment, and storing the translated code segment into a translated code cache.
An embodiment of a dynamic translator for executing a non-native program, the translator, according to the present invention, includes an interpreter configured to receive non-native code words from a non-native code image of the non-native program and interpret the non-native code words by executing native code words. The interpreter is also configured to generate branch history data including a start address and a branch history value for each of a series of traces during execution of the non-native program. The interpreter includes a control section configured to output the start address of a currently executing trace and receive a cache hit signal and a cache miss signal, wherein the control section suspends operation of the interpreter responsive to the cache hit signal and continues operation of the interpreter responsive to the cache miss signal, where the cache miss signal includes a target address and the interpreter continues operation at the target address. A trace buffer is provided which is configured to receive and store the branch history data for the series of traces. The dynamic translator also includes a trace selector configured to receive the branch history data for the series of traces stored in the trace buffer and further configured to receive the non-native code image. The trace selector is configured to count the occurrences of each start address in the branch history data and mark as hot each start address having a count which exceeds a threshold. The trace selector then disassembles and translates the non-native code words for each hot trace into a translated code segment. A translated code cache is configured to receive and store the translated code segment for each hot trace, where the translated code cache receives the start address of the currently executing trace from the control section of the interpreter and, responsive thereto, generates the cache hit signal if a translated code segment corresponding to the start address resides in the cache and generates the cache miss signal if a translated code segment corresponding to the start address does not reside in the cache. The translated code cache returns an untranslated instruction address as the target address when execution of the translated code segment branches to the untranslated instruction address.
An embodiment of a method for dynamically disassembling an executing program, according to the present invention, includes generating a series of execution traces corresponding to executed instructions of the executing programs, where each execution trace includes a start address of the trace and a branch history value of the trace, wherein the branch history value includes a bit corresponding to each branch instruction in the trace and indicates whether the branch instruction branched to its target address or fell through to its subsequent address. The method then calls for identifying a trace for disassembly, sequentially walking through each instruction of a code image of the executing program beginning with the start address of the identified trace until a branch instruction is encountered, and checking the bit of the branch history value of the identified trace that corresponds to the branch instruction. The method then resumes walking through the code image of the executing program at the target address of the branch instruction if the corresponding bit of the branch history value indicates that the branch instruction branched to its target address, and resumes walking through the code image of the executing program at the subsequent address to the branch instruction if the corresponding bit of the branch history value indicates that the branch instruction fell through to the subsequent address.
The foregoing and other objects, features and advantages of the invention will become more readily apparent from the following detailed description of several embodiments of the invention which proceeds with reference to the accompanying drawings.