Dynamic Binary Translation (DBT) has been widely used as a means to run applications created for one instruction-set architecture (ISA) on top of a different ISA. Given the amount of legacy software developed for PCs, based on the x86 ISA, attention has been given to translating x86 to other ISAs. Recent trends in industry for both smaller ultra-mobile PCs and more powerful embedded and mobile internet devices (e.g., smart phones) are blurring the boundaries between these distinct markets. As a result, this market convergence is creating great interest in DBT from ISAs that currently dominate the embedded and mobile-internet-device markets (e.g., ARM (ARM Holdings), MIPS (MIPS Technologies), and PowerPC (Apple-IBM-Motorola alliance)) to x86 (Intel Corporation).
Binary Translation (“BT”) is a general technique to translate binaries built for one source (“guest”) ISA to another target (“host”) ISA. Using BT, it is possible to execute application binaries built for one processor ISA on a processor with a different architecture, with no need to recompile high-level source code or rewrite assembly code. Since most legacy computer applications are available in binary formats only, BT is very attractive due to its potential to allow a processor to execute applications that are not built and available for it. Several successful BT projects have advanced the state of the art over the past decades, including Digital Equipment Corporation's (“DEC”) FX!32, Intel's IA-32 EL, Transmeta's CMS (“Code Morphing Software”), Godson-3 (MIPS architecture), and IBM's DAISY (“Dynamically Architected Instruction Set from Yorktown”). Most of these tools aim at running legacy x86 applications on processors such as Alpha (DEC), Itanium (Intel), Crusoe (Transmeta), and MIPS (MIPS Technologies).
Most of the tools mentioned above use Dynamic BT (DBT), meaning that they perform the translation on-the-fly as the application is executed, i.e., at run time. Alternatively, BT can be performed off-line, i.e., Static BT (SBT). The dynamic usage model is usually preferred because it is more general (e.g., able to handle self-modifying code) and it works transparently to the user with a simple OS change to automatically invoke the DBT for non-native binaries. The main drawback of DBT, compared to SBT, is the overhead. The cycles spent in translating and optimizing an application are cycles that could otherwise be used to actually execute the application code. Therefore, DBT tools face a trade-off between the time spent on translation/optimization and the quality of the resulting code, which in turn is reflected in the execution time of the translated code.
The challenges faced by a DBT are highly dependent on the source and target ISAs. Recently, there has been great interest in expanding the use of the x86 ISA into the ultra-mobile and embedded market segments (e.g., Intel's Atom processor). From a user's perspective, this is very convenient because it may enable legacy PC software to efficiently run on embedded and ultra-mobile platforms. However, for x86 to be adopted in these new domains, it is also necessary to enable x86 to execute the enormous software-base available in these segments, which are mainly based on ARM (ARM Holdings), MIPS (MIPS Technologies), and PowerPC (Apple-IBM-Motorola alliance) ISAs. For example, in the future x86-based smart phones, besides potentially running PC applications, it would be beneficial to be able to download and seamlessly run ARM-based applications from, e.g., Apple's iPhone App Store. Challenges to enabling this scenario include keeping DBT's performance and energy overheads low.
Although a variety of DBT systems have been proposed, most of them follow the same basic execution flow. First, a binary file created for the source (guest) architecture is loaded into memory. Sections of this source binary are then translated into target (host) binary code. This translation is typically done “on-demand”. In other words, the source-code instructions are translated as the flow of control reaches them. Typically, the translation is performed at the granularity of basic blocks, which are sequences of instructions with a single entry and potentially multiple exits. Once a basic block is translated, the translation is kept in a translation cache (also called code cache) in memory for future reuse. The most aggressive DBT systems perform different levels of optimizations. Following Transmeta's CMS and other DBT systems, these optimization levels are termed “gears”. First, a very quick translation (Gear-1) is used. This gear aims at being very fast, at the cost of poor quality of the translated code. This trade-off tends to be ideal for rarely executed code, such as OS boot code. In Gear-1, DBTs also implement probes (counters) to detect “hot” (i.e., frequently executed) basic blocks. Once a block becomes hot, it and its correlated surrounding blocks are merged into a region. This region is then retranslated by a higher gear, which applies additional optimizations to the code. This same strategy can be repeated for an arbitrary number of gears. For instance, Transmeta's CMS uses four gears. In effect, a gear-based system ensures that, the more a region of code contributes to the total runtime, the more time is spent optimizing it to produce faster code.
The set, number, and aggressiveness of the optimizations applied greatly vary from one DBT to another. Typical optimizations include: instruction scheduling, dead-code elimination, and redundancy elimination. In fact, similar to static compiler optimizations, the set of most relevant optimizations depends on the target architecture. Contrary to compiler optimizations though, in a DBT, these optimizations have precise runtime information, which can be used to obtain higher-quality code. The main disadvantage of DBTs compared to static compilers is a much tighter optimization-time budget.