1. Field of the Invention
The present invention is generally related to register usage optimization, and more particularly related to an apparatus and method for efficiently obtaining and utilizing register usage information for register optimization during software binary translation.
2. Description of Related Art
As is known in the computer and software arts, when a software program is developed it will be optimized to run on a particular computer architecture. While it is possible that the software program developed for an original computer architecture will run on a computer system with a new architecture, the execution of the software program optimized for an old computer architecture will not generally run as quickly on a computer system with a new architecture, if at all.
Therefore, devising a way to run an existing (i.e. old) architecture binary version of a computer program on a new architecture, or improve the performance of the computer program on the existing architecture, is an important procedure. Several techniques are used in the industry to run the binary code of an old architecture on a new architecture. Four common techniques, from slowest to fastest will now be discussed: software interpreter; microcoded emulator; binary translator; and a native compiler.
A software interpreter is a program that reads instructions of the old architecture, one at a time, performing each operation in turn on a software-maintained version of the old architecture""s state. Interpreters are not very fast, but they run on a wide variety of machines and can faithfully reproduce the behavior of self-modifying programs, programs that branch to data, programs that branch to a checksum of themselves, etc. Caching interpreters gain speed by retaining predecoded forms of previously interpreted instructions.
A microcoded emulator operates similar to that of a software interpreter, but usually with a number of key hardware assists to decode the old instructions quickly, and to hold hardware state information in registers of the micromachine. An emulator is typically faster than an interpreter, but can run only on a specific microcoded new machine. This technique cannot be used to run existing code on a reduced instruction set computer (RISC) machine, since RISC architectures do not have a microcoded hardware layer underlying the visible machine architecture.
A translated binary program is a sequence of new-architecture instructions that reproduce the behavior of an old-architecture program. Typically, much of the state information of the old machine is kept in registers in the new machine. Translated code faithfully reproduces the calling standard, implicit state, instruction side effects, branching flow, and other artifacts of the old machine. Translated programs can be much faster than ones operated upon by interpreters or emulators, but slower than native-compiled programs.
Translators can be classified as either (1) bounded translation systems, or (2) open-ended translation systems. In bounded systems, all the instructions of the old program must exist at translation time and must be translated to new instructions. This usually requires manual intervention to find 100 percent of the code. In open-ended systems, program code may be discovered, created, or modified at execution time, and can generally be fully automatic.
A native-compiled program is a sequence of new-architecture instructions produced by recompiling the program. Native-compiled programs usually use newer, faster calling conventions than old programs. With a well-tuned optimizing compiler, native-compiled programs can be substantially faster than any of the other choices. However, the problem is that this process requires source code to be implemented, and source code is not always available.
Most large programs are not self-contained; they call library routines, windowing services, databases, and toolkits, for example. These programs also directly, or indirectly, invoke operating system services. In simple environments with a single dominant library, it can be sufficient to rewrite that library in native code and to interpret user programs, particularly user programs that actually spend most of their time in the library. This strategy is commonly used to run Windows and Macintosh programs under the UNIX or LINUX operating system.
One requirement for binary translation is that the behavior of the binary code cannot be changed. This is because the state of the binary execution is stored in hardware registers and in memory locations. This means that no useful register values can be destroyed because it would cause execution errors.
On the other hand, to optimize the performance, and to instrument a procedure for profiling, usually requires additional registers. One approach is to always save and restore register values before and after the part of the code that uses the additional registers by the translator.
A better solution is to analyze the binary code to discover which registers do not contain useful or live information. This approach incurs severe time overhead and sometimes a code cannot be completely analyzed. In other cases, free registers cannot be discovered, even though a large amount of time is spent on analyzing.
A third approach involves an agreement between the compiler, which is responsible for generating the original binary, and the translator. The compiler is limited to use certain registers, while other registers are left available to be used by the translator, regardless of whether the translator needs that many registers.
There are numerous things that are important to consider during a software binary translation, such as register allocation and assignment. Register allocation and assignment, for almost all computer architectures, is among one of the most important of all optimization techniques. One goal of optimization is to minimize the traffic between CPU registers, which are usually few and fast to access in whatever lies within memory. This memory includes one or more levels of cache, and main memory which is generally much slower to access, but also larger in size. The main memory and cache memory generally increase in size and decrease in speed the further removed they are from the registers.
Register allocation determines which of the values (variables, temporaries, and large constants) might be better utilized if retained within the machine registers. Register allocation is important because the registers are almost always a scarce resource. There are rarely enough of them to hold all the objects that the programmer would like them to hold, and because of RISC systems, almost all operations other than data movement operate entirely on register contents and not storage. In modern complex instruction set computing (CISC) implementations, register to register operations are significantly faster than those that take one or two memory operands.
Heretofore, software developers have lacked an efficient apparatus and method for accomplishing notification of register usage and register optimization during code translation.
To achieve the advantages and novel features, the present invention is generally directed to an apparatus and method for efficiently accomplishing register optimization during code translation. The present invention for register optimization during code translation utilizes a technique that removes the time overhead for analyzing register usage, and removes fixed restraints on the compiler register usage. This is accomplished by making the task of finding free registers more efficient by communicating between compiler and the translator.
In the present invention, the compiler produces a bit vector for each program unit, (i.e., subroutine, function, and/or procedure). A bit in the vector represents a particular caller-saved register. A bit is set if the compiler uses the corresponding register within that program unit. During the translation, the translator examines the bit vector to very quickly determine which registers are free, and therefore can be used during the register optimization, without having to save and restore the register values.
In another embodiment, the software program can be further optimized by taking a logical xe2x80x9cORxe2x80x9d of the bit vectors between different program units, (i.e., subroutines, functions, and/or procedures), where the resulting bit vector from the logical xe2x80x9cORxe2x80x9d indicates which registers are free to be used for translation when the translator provides code for more than one program unit.
An advantage of this is that the bit vector technique is particularly useful for performance improving translations performed at runtime. Translation performance is improved because the analysis overhead that would directly reduce performance is not performed. In the preferred method of the present invention, because the translator may inspect the bit vector very quickly, the overhead is dramatically reduced, which results in improved runtime performance. The preferred method of the present invention utilizes a data structure (a register usage bit vector) that is a vehicle (or communication channel) between a static compiler and a binary translator. The register usage bit vector is used to simplify the identification of free registers in the main transformation phase of the translator.