1. Field of the Invention
The present invention generally relates to computing systems employing dynamic compilation, such as dynamic optimization and binary translation techniques, and more particularly to a method and apparatus for transferring control from the execution of a portion of a first representation of a program to a portion of the program in a second representation of the program.
2. Description of the Related Art
Dynamic optimization is the transformation of a program's description from a first representation in a particular instruction set—this usually being the original representation as generated by the programmer and his tools—into a second representation of the program in the same instruction set, the second representation being able to better exploit the characteristics of a given microprocessor implementing said instruction set given the characteristics of the particular instance of the program. Dynamic optimization employs a number of techniques including instruction scheduling, speculative execution, common subexpression elimination, code motion, loop optimization, code layout and so forth. Ebcioglu, Altman, Gschwind, Sathaye, “Optimizations and Oracle Parallelism with Dynamic Translation”, ACM/IEEE 32nd International Symposium on Microarchitecture, Haifa, Israel, November 1999, give an overview of such techniques and their use in a system.
Dynamic binary translation is the transformation of a program's description from a first representation in a first instruction set—this usually being the original representation as generated by the programmer and his tools—into a second representation of the program in a second instruction set, the second representation being able to better exploit the characteristics of a given microprocessor implementing said instruction set given the characteristics of the particular instance of the program and the second instruction set offering some desirable characteristics to be exploited by the second representation, such as the ability to express parallel operations in long instruction words such as to better exploit instruction level parallelism inherent in programs, or simplicity to implement such as to reduce design time, design cost, die size, or power consumption, or compatibility with an architecture having a large installed base or with a newly introduced emerging architecture, or to offer any number of other advantages a particular instruction set may have. Gschwind, Altman, Sathaye, Ledak, Appenzeller, “Dynamic and Transparent Binary Translation”, IEEE Computer, pages 54-59, March 2000 give an example of a high-performance binary translation system translating the IBM PowerPC (TM) architecture to the BOA long instruction word architecture.
Dynamic optimization and dynamic binary translation are collectively referred to as dynamic compilation.
Traditionally, dynamic compilation systems have consisted of two distinct phases: a first interpretation phase in which instructions from a first interpretation of a program are interpreted once or multiple times, and finally compiled into a second representation which is emitted into a pool of code fragments. When the interpretation system discovers an address for which a translation already exists, it suspends interpretive execution and transfers to the code fragment in the second representation which implements the desired functionality. Code then executes at full speed, possibly transferring from one fragment of compiled code to the next, until a new section of code is discovered which does not yet have a corresponding fragment in the second representation.
The first interpretative phase serves a multitude of purposes, such as deciphering the semantics and structure of the program in a first representation, providing an initial means for executing the program while profiling information is gathered to characterize the behavior of the program, and finally to avoid the translation of infrequently executed code.
The importance of the last purpose of interpretation, namely to provide a filter which selects the instructions which will actually be compiled, should be evident from the fact that it takes tens to hundreds of cycles to interpret a single instruction from the first representation, but takes thousands (or more) cycles to compile an instruction from the first representation to its counterpart in the second representation. Execution speed of a single instruction when executed natively is a few cycles (usually 1-5 cycles per instruction (CPI), depending on instruction complexity, workload characteristics, system configuration such as MP characteristics, and the performance of a particular core). After optimization, an instruction from the original representation may execute at native speeds slightly higher than in the original program, but the improvement usually corresponds to a very low absolute number (usually fractions of a cycle). Thus to improve overall program execution, every translated instruction must be executed sufficiently often to amortize its own translation cost, as well as other system overheads such as interpretation and system housekeeping. Thus, unless high code reuse is present in an application, dynamic compilation cannot improve program execution performance and it may actually deteriorate significantly.
Silberman and Ebcioglu, “An Architectural Framework for Supporting Heterogeneous Instruction-Set Architectures”, IEEE Computer, June 1993 introduced a revolutionary concept in which binary translation would be used to actually improve execution speed through the combined use of dynamic binary translation and dynamic optimization. This approach is based on the design of a high-performance VLIW architecture to which instructions from the original architecture are translated. To reduce the interpretation cost, the proposal uses two engines, a native code engine and a migrant engine. In this design, the migrant engine is responsible for compatible execution of legacy code whereas the native engine executes optimized code for a simpler, high performance architecture, e.g., a superscalar or VLIW design. This design uses a switch table which contains the address of code entry points into the native shadow code, giving a correspondence between native and migrant code addresses. When a jump instruction is attempted, the architecture performs a lookup of the migrant target address, to determine if a translation exists for the target address within the native shadow code. If an entry exists, control is transferred to executing the native (shadow) code representation of the program.
This approach is limited by the design constraints for the switch table. To achieve high performance, the switch table has to be implemented using a content-addressable memory (CAM) structure. However, the size (i.e., the number of entries) of CAM structures is inherently limited by the capabilities of the CAM structures, and large numbers of entries lead to slow circuitry which limits processor frequency. If a CAM structure is not used to provide single cycle table access, the design incurs a significant CPI penalty, in that multiple cycles may be required to implement a branch that includes a switch table lookup.
As has been described, one major aspect of dynamic compilation is a tradeoff between the gains which can be made from the dynamic optimization, and overheads which are incurred by the process of interpretation and compilation. While compilation is a necessary condition in a dynamic compilation system, much penalty could be avoided in programs exhibiting significant amounts of code with low reuse by reducing or eliminating the cost of interpretation.
Methods advocated in the past involve replacing the instruction at which a transfer should occur with a special JUMP or BRANCH instruction, but this intrusive approach changes the code, as well as the expected result for code which also interprets its own code as data, e.g., to compute a checksum to ensure program integrity. Non-intrusiveness, therefore, clearly has advantages, but heretofore has also had a number of undesirable properties, such as not dealing well with code modifications occurring during program execution, the handling of code in read-only segments, and self-referential code.
Thus, systems employing known switch monitors suffer either excessive memory consumption requirements if a switch monitor as described in May, “Mimic: A Fast S/370 Simulator”, ACM SIGPLAN 1987 Symposium on Interpreters and Interpretive Techniques, 1987, is used wherein a switch entry is associated with each migrant instruction address, or from massive hardware requirements if all known entries are to be stored in a CAM memory structure, or from slow performance.