1. Field of the Invention
This invention relates to a dynamic binary translator, which may be used in a virtual machine monitor to virtualize a computer system or an emulator that simulates a computer system.
2. Description of the Related Art
Binary translation is a technique that allows the execution of binary codes for a first architecture (the simulated architecture) on a second architecture (the host architecture). Binary translators between different architectures are known as cross-architectural. The two architectures may, however, be identical. In this latter case, the binary translation is often used to instrument an executable instruction so that the execution of the program provides additional information about its execution.
Binary translators provide a performance advantage over software interpreters. Software interpreters simulate in software the fetch-decode-execute cycle of the simulated architecture by reading each instruction one at a time and simulating its execution. Binary translators offer superior performance by taking groups of instructions (or even the entire program) and generating a corresponding sequence that executes directly on the host processor. Binary translators fall into two main categoriesxe2x80x94static and dynamic.
Static binary translators perform the translation of the original instruction sequence before the execution of the program. In their seminal paper xe2x80x9cBinary translationxe2x80x9d (Communication of the ACM Volume 36, 1993), Sites, et al., give a good introduction to the topic of static binary translators. Certain translators, known as closed translation systems, require that all of the instructions that the program eventually executes must be known at translation time; one example of such a translator is the binary editing ATOM system described by Alan Eustace and Amitabh Srivastava in xe2x80x9cATOM: A Flexible Interface for Building High Performance Program Analysis Tools,xe2x80x9d Digital WRL Technical Note 44.
Other binary translators, known as open translation systems, attempt to translate as much of the code as possible and revert to a slower software emulator for the portions that have not been translated. This is the case in the VAX-to-Alpha translator described by Sites, et al., and also in the FX!32 system from Compaq/DEC that translates x86 binaries to Alpha.
Dynamic binary translators perform the translation from an original instruction sequence to a host instruction sequence during the execution of the program. The translated code sequences are then stored in a buffer called the translation cache. The binary translation function is interleaved with the execution of the output of the binary translator. Dynamic binary translators have been used in architectural simulators such as Shade (See Cmelik and Keppel, xe2x80x9cShade: A Fast Instruction-Set Simulator for Execution Profiling,xe2x80x9d SIGMetrics ""94) and machine simulators such as SimOS (see Witchel and Rosenblum, xe2x80x9cEmbra: Fast and Flexible Machine Simulation,xe2x80x9d ACM SIGMetrics ""96). Dynamic binary translators have also been used to build virtual machine monitors (see Ebcioglu and Altman, xe2x80x9cDAISY: Dynamic Compilation for 100% Architectural Compatibility,xe2x80x9d IBM Research Report #20538). Dynamic binary translators are also used to build fast Java Virtual Machines; in that context, they are sometimes referred to as xe2x80x9cjust-in-time compilers.xe2x80x9d
A dynamic binary translator is also used to provide cross-architectural compatibility as described in U.S. Pat. No. 5,832,205 (xe2x80x9cMemory Controller for a Microprocessor for Detecting a Failure of Speculation on the Physical Nature of a Component Being Addressed,xe2x80x9d Kelly, et al.). In the discussion below, this system is referred to as the xe2x80x9cTransmetaxe2x80x9d system.
Binary translators share one common problem not found in simpler software interpreters: the execution of the translated code sequence can lead to a correct execution of the program only if the original sequence that it emulates has not been modified since its translation. Certain systems, including most static translators and some dynamic translators assume that such modifications don""t occur; these systems ignore the problem rather than solve it. More recent dynamic binary translators such as SimOS, DAISY, and the Transmeta processor, however, guarantee the coherency of the translations that have been generated and are stored in the translation cache. In effect, they solve the translation cache coherency problem, but they do so using a technique referred to below as xe2x80x9cconflict detection.xe2x80x9d This technique has the disadvantage that it leads to poor performance in a wide range of cases. The technique of conflict detection is discussed in greater detail below.
Rather than simulating CPUs by interpreting an instruction sequence one instruction at a time, dynamic binary translators thus translate blocks of instructions into code that, when executed, emulate the execution of the original block. The translated blocks are then stored into a memory buffer for further reuse. The use of binary translation eliminates most of the overheads of software interpretation.
A basic binary translation system according to the prior art is illustrated in FIG. 1. As FIG. 1 illustrates, a conventional binary translation system typically consists of a translator 100, a chaining subsystem 110, an access module 120, and various callout routines. By way of example, two callout routinesxe2x80x94for privilege emulation 130 and I/O 132xe2x80x94are illustrated.
The translator converts blocks 140 of instructions received as input from a virtual machine (VM) 142 or other emulated system into a sequence of instructions that run on the host architecture. The generated code (or emitted code) is then stored into a memory buffer known as a translation cache (TC) 150. In FIG. 1, three translated and stored code sequences are illustrated as blocks labeled A, B and C.
Callout routines are functions of the simulation system or virtual machine monitor that can be called by the code emitted by the translator. The translator inserts direct or indirect calls to these functions into the emitted code. Callout routines are used, for example, to emulate certain instructions with complicated semantics. In FIG. 1, code block A is shown as having a callout to both routines 130 and 132.
The chaining subsystem is a mechanism that allows an emitted instruction sequence to directly branch to another emitted sequence, without relying on more than one callout (the callout to the chaining subsystem itself). In FIG. 1, the chaining module 110 is shown as having inserted a branch or xe2x80x9cpatchxe2x80x9d (symbolized by an asterisk *) between code blocks B and C within the TC 150. This is an optimization over a naive implementation in which all emitted code blocks always end with a callout to a routine that looks up the translation of the next basic block.
The idea behind binary translation is the reuse of previously generated translations. The access module 120 determines the location in the TC 150, if one is to be found, of the translation that corresponds to the start of a given instruction sequence. If no translation is found, then the access module transfers control to the translator to generate a translation.
This basic design of a binary translation system relies on the invariance of the code, that is, the instruction sequences, that served as the input to the binary translator. If the content of the program stored in the memory of the simulated system or virtual machine changes during the execution, then the cached translations that emulate the behavior of the modified instructions are typically discarded, effectively forcing the translator to re-translate the instruction sequence a second time.
Early binary translators such as Shade effectively ignored the problem of possible code variance, since it did not occur in the cases for which Shade was designed. More recent systems such as SimOS and DAISY, however, address and solve the problem by detecting inconsistencies and taking appropriate actions in the case of a violation of translation cache coherency.
SimOS includes a MIPS binary translator called Embra. Embra is described in xe2x80x9cEmbra: Fast and Flexible Machine Simulation,xe2x80x9d by Witchel and Rosenblum, which is cited above. The Embra simulator needs to simulate in software the memory management unit (MMU) of the simulated system. This means that the simulator must translate each data reference from a virtual address issued by the simulated processor to a physical address. The simulated physical address is then used to index into the simulated memory. Both versions use a data structure called the xe2x80x9cquick checkxe2x80x9d to allow the code emitted by the binary translator to easily and efficiently determine 1) the physical address of a given virtual address, that is, the relocation of virtual addresses; and 2) whether the access is legal. Those skilled in the art will recognize that the quick check may include only a subset of the mappings that are currently in the simulated MMU. If an entry is not in the quick check, then the emitted (dynamically generated) code will do a callout that simulates an exception if the mapping is not in the MMU, and possibly also insert a plurality of entries in the quick check.
The Embra simulator detects conflict violations as follows: Mappings that point to pages that contain at least one byte of input to the current set of cached translations are never inserted quick check. All accesses to these pages rely instead on the slower callout mechanism, which, as a side effect, ensures the coherency of the translation cache by discarding conflicting translations. In effect, the quick check data structure acts as a conflict detection module 180.
In contrast to the software-based SimOS solution, the DAISY system uses specific hardware support to mark portions of the physical memory of the virtual machine that contain input to translators so that the processor cannot access them. Although it is unclear from the technical report referenced above, it seems that the proprietary hardware has a notion of memory xe2x80x9cunitxe2x80x9d and associates one bit for each unit of physical memory. Moreover, the size of the unit seems to be settable, for example to match a PowerPC page size (4 KB), but also smaller granularities down to 1 byte.
The Transmeta chip uses a technique similar to that found in DAISY, namely, the xe2x80x9cTxe2x80x9d bit of the xe2x80x9ctranslation-lookaside bufferxe2x80x9d TLB. This is described in U.S. Pat. No. 5,832,205.
As will become clearer from the discussion below, the present invention also incorporates a conflict-detection mechanism. In particular, the preferred embodiment of the invention uses the hardware MMU to detect conflicts. Unlike these systems, however, it does so on conventional hardware in the context of a virtual machine monitor.
What is needed is therefore a binary translation system that more efficiently handles not only the problem of maintaining translation cache coherency, but that also more efficiently addresses and solves the problems that arise due to self-modifying code. This invention provides such an improvement.
The invention provides a system and a method for virtualizing a computer using binary translation. According to the invention, input instruction sequences are converted by binary translation into output instruction sequences that emulate the corresponding input instruction sequences. The input instruction sequences being stored in predetermined pages of a system memory. The output instruction sequences are stored in a translation cache.
The invention maintains coherency of the output instruction sequence with the input instruction sequence by selectively executing either of the following sub-steps: 1) it detects conflicts in the memory pages in which a first set of the input instruction sequences is stored and executing the corresponding output instruction sequences only in the absence of detected conflicts; or 2) it explicitly checks for code-invariance by checking for post-translation changes in a second set of input instruction sequences by comparing the copied input instruction sequences with a current version of the corresponding input instruction sequence, before executing the corresponding output instruction sequence.
In the preferred embodiment, the binary translation system according to the invention sets the hardware memory management unit (MMU) of the computer itself to detect memory traces on the memory pages in which the first set of the input instruction sequences is stored.
The step of checking for post-translation changes includes the following sub-steps: a translation-time copy of each input instruction sequence is stored in the second set of instruction sequences; an instruction invariance prelude is appended to a translation-time copy of each output instruction sequence for which there and storing the instruction invariance prelude in the translation cache along with the corresponding output instruction sequence; for each output instruction sequence for which there is an instruction invariance prelude, the instruction invariance prelude is executed before executing the corresponding output instruction sequence. In the preferred embodiment of the invention, the sub-step of executing the instruction invariance prelude preferably comprises the further sub-steps of comparing the corresponding translation-time copy of the input instruction sequence with a corresponding current state of the input sequence with a current; and executing the output instruction sequence only when the current state is the same as the translation-time copy.
In the preferred embodiment of the invention, storing of the translation-time copy of each input instruction sequence is done by encoding the translation-time copy as an immediate operand of compare instructions in corresponding ones of the instruction invariance preludes. Preferably, only the input instruction sequence for which the current state is different from the translation-time copy is reconverted by binary translation.
The invention provides a unique method for efficiently processing code that is self-constant-modifying. For each output instruction in an output instruction sequence for which the current state is different from the translation-time copy, the system therefore is also provided to detect whether the output instruction sequence is self-constant-modifying. Thereafter, only the input instruction sequence for which the current state of operation portions of each instruction included in the sequence is different from a corresponding portion in the translation-time copy is preferably reconverted by binary translation.
In the preferred embodiment of the invention, the system preferably checks for run-time invariance of the operational instruction portion by executing a code-invariance prelude that is appended to the output instruction as stored in the translation cache. For each input instruction sequence for which the current state of the operational portions is the same as corresponding portions in the translation-time copy, the system then executes a constant-updating prelude that is appended to the output instruction and thereby updates the modifiable constant portion of the output instruction by replacing it with a corresponding run-time constant portion of the input instruction in the current state. Furthermore, the system then executes the output instruction using the updated modifiable constant portion.
The binary translator according to the invention preferably switches between conflict detection and code-invariance checking of the input instruction sequences according to a predetermined memory page cost optimization function, which is evaluated for each memory page that contains input instructions.