1. Field of the Invention
This invention relates to computer programs in general, and in particular, to a method and related system for implementing subroutine calls and returns in the context of binary translation of instructions in an original language into instructions in a target language, which may be the same as the original language.
2. Description of the Related Art
It is a well known fact that software constitutes a large fraction of the cost of computer systems. On the surface, this may seem surprising since, once developed, software can be installed and replicated without limit. The cost, however, stems from the difficulty of initial development, and the need for ongoing maintenance in the form of customization, defect elimination, and development of upgrade versions over the lifetime of the software. To give an indication of the magnitude of software engineering costs, consider that writing what is now considered a medium-sized software package may require hundreds of man-years of investment initially; moreover, following the first deployment, ongoing maintenance may demand comparable resources.
One of the hardest challenges in software engineering, be it initial development or subsequent maintenance, is the management of complexity. In particular, preventing a change or addition to one part of a system from having unforeseen and undesirable consequences in another part of the system can require significant effort. Consider, for example, that a large software system may contain millions of lines of program code, any one of which could potentially interact with any other, and it may be seen that the potential for errors is quite large. This is particularly true since no individual could write all the code, nor could any individual be familiar with all of it, once written. Early on, software developers and researchers recognized that in order to work effectively in such environments where individuals have only partial knowledge of the full system, systematic development techniques must be followed.
Perhaps the most widely employed development technique involves the decomposition of software into subroutines, also known as subprograms, functions, procedures, or methods. A subroutine comprises a number of program statements and optional data structures to perform a given task. The subroutine logically encapsulates the individual statements, allowing them to be invoked (xe2x80x9ccalledxe2x80x9d) as a group from elsewhere in the program. The effect of the subroutine invocation is to execute the statements encapsulated in the subroutine. When the last such statement completes, execution returns to the point in the program where the subroutine was invoked.
With subroutines, then, instead of solving a top-level problem directly, programmers partition it into a number of smaller problems, such that a solution to the top-level problem can be obtained by combining solutions to the smaller problems: Each smaller problem""s solution is encapsulated into a subroutine, enabling the large problem""s solution to be expressed as a sequence of subroutine invocations. Often, but not necessarily, the decomposition follows a hierarchical pattern in which higher-level subroutines are implemented in terms of lower-level subroutines, which in turn are implemented from even lower-level subroutines, until the point where the problems have been partitioned sufficiently that solutions can be expressed directly using primitive statements from the programming language.
The use of subroutines provides multiple advantages in software development. First, complexity is reduced locally: The number of logical steps required to solve a given problem can be kept small because the solution can be expressed in terms of higher-level operations implemented in subroutines instead of in terms of the low-level primitives defined directly by the programming language. Second, complexity is reduced globally: Because subroutines encapsulate groups of statements, programmers can often reason about the interaction of subroutines rather than the interaction of individual statements across the program. Without this encapsulation, it would be very difficult to implement large-scale software systems. Third, subroutines allow for code reuse: Once a solution to a sub-problem has been implemented and made available as a subroutine, it can be used as a building block for solving many different problems; this greatly reduces the time required to implement software, since it is not necessary to start from scratch each time. It also reduces the size of programs, since general-purpose subroutines need only be provided once even though they are used in multiple places.
From all of these reasons, and more, the use of subroutines has become fundamental to software engineering. As a result, during execution of programs written in this manner, computers will execute a large number of subroutine calls and returns.
Consider now how subroutines may be implemented on contemporary computers. In other words, consider programming language implementation.
Most of the time, programmers write software in high-level programming languages such as Cobol, Fortran, Modula-2, C, C++, or Java. All of these languages provide subroutines in some form. While the details vary in terms of both syntax and semantics (especially with respect to parameter passing), many similarities remain. In particular, all these languages provide a xe2x80x9clast-in, first-outxe2x80x9d (LIFO) ordering on subroutine calls and returns: the last subroutine to have been called will be the first one to return. For example, let A, B, and C denote subroutines and suppose that A calls B, and B calls C. If a xe2x80x9creturn-from-subroutinexe2x80x9d statement is executed, it will terminate the execution of subroutine C (the one called most recently) and execution will continue in subroutine B at the point that immediately follows the invocation of C. Later, a return statement in B may terminate B""s invocation and take execution back to subroutine A.
Because subroutine execution respects this LIFO order, an efficient implementation can be realized by using a push-down stack. With this well known implementation technique, a subroutine invocation, such as A calling B, is performed in two steps. First, the return address is pushed onto the stack. Second, the program counter is updated to indicate the first statement of subroutine B, that is, execution xe2x80x9cjumpsxe2x80x9d to the beginning of B. The execution of B now proceeds without regard to where it was invoked from. Eventually, a return statement in B will be encountered or, equivalently, the last statement in B will complete. In either case, to return back to its caller, subroutine B need only perform a single step: it pops the top-most item from the stack, which will be the address to which it should return, and places this value in the program counter register. Now, instruction fetching and execution will continue from the point in the caller (A in the example) that follows the call to B.
The use of a stack provides a high degree of generality. A subroutine, such as B, can correctly return back to its caller A, even if B, during its execution, performs further subroutine invocations. For instance, if A calls B, then the stack will contain xe2x80x9cAxe2x80x9d when the execution of B commences. Now, if B later calls C, then the stack will contain two return addresses xe2x80x9cA; Bxe2x80x9d where B is the most recent (top-most) item. When C returns, it will pop the topmost item from the stack (B) leaving just xe2x80x9cAxe2x80x9d on the stack. This is the same state as before the call of C in B, so following the invocation and completion of subroutine C, B can execute to completion and return back to A by popping the return address from the stack in the usual manner. (Merely for the sake of notational simplicity, one may equate return addresses with the caller subroutine; in actual implementations, the return addresses must indicate the precise statement within the caller to which execution should return. Often, but not necessarily, this return address will be represented as a memory address.)
The stack implementation of subroutine calls and returns also allows a subroutine to invoke itself. This ability leads to a powerful programming technique known as recursion, and makes possible elegant solutions to a number of problems. During recursive subroutine invocations, the stack will contain a repeated sequence of return addresses xe2x80x9c . . . A; A; . . . ; Axe2x80x9d, but more general patterns involving mutual recursion between two or more different subroutines can also be handled in the stack implementation of calls and returns.
Fundamental to many programming languages and uses of subroutines are the concepts of local state and parameters. Because these concepts are well understood in the art, they will be discussed only briefly here. Most subroutines make use of local variables for carrying out their computations. In many implementations, it is desirable to allocate storage for local variables on the same stack that holds the return addresses. The stack allocation provides two advantages. First, storage is only committed to a subroutine""s variables when the subroutine is active. Second, recursive subroutines can have multiple instances of these variables (one per invocation), thereby preventing awkward interference that would result if recursive invocations were to share the local variables. In addition to supporting local variables, subroutine invocation mechanisms will often also provide a mechanism for passing parameters from the caller to the called subroutine. Simplifying slightly, one may think of these parameters as a form of local variables that are initialized by the caller.
It is common to handle all these facets of subroutine invocation by using the concept of activation records (frames). An activation record is a consecutive range of storage on the stack. The activation record contains fields for the return address, the parameters and the local variables. Each subroutine invocation will push one activation record on the stack, and each subroutine return will pop one activation from the stack. Activation records appear and disappear in the same LIFO order as subroutine invocations begin and end.
Without loss of generality, and for the sake of clarity, one may think of and refer to subroutine invocations as pushing and popping simple program counters rather than full activation records. This convention is followed below.
Non-stack-based techniques also exist for implementing subroutine calls, but they have limitations that make them less desirable than the stack approach, except when special circumstances call for their use. In one alternative technique, for example, the caller subroutine writes the return address into a known location that is associated with the called subroutine. The called subroutine then performs a return by setting the program counter to the value found in that known location. Since there is only one such location per subroutine, recursion cannot be supported. In another alternative, subroutine invocations construct a xe2x80x9clinked listxe2x80x9d of activation records in an object heap. In this case, recursion can be supported, but the costs of heap-allocating and reclaiming the activation records tend to be higher than the costs of using a stack. Unless there are other compelling reasons for using heap allocation, the stack approach is therefore usually considered to be superior.
Because the stack implementation offers significant advantages, it has found widespread use. In turn, and because of this frequent use, most modern CPU""s implement direct support for stack-based calls and returns in their machine-code language. These low-level (xe2x80x9chardwarexe2x80x9d) call and return instructions are designed to facilitate translation of high-level programming language subroutine invocations and returns into their low-level counterparts. While the hardware instructions do not implement the full semantics of subroutine invocations as found in many high-level languages and the translation therefore requires the use of additional instructions (for example, for passing parameters), the hardware support for calls and returns nonetheless ensures that subroutine invocations can be implemented very efficiently. Indeed, programmers have come to rely on efficient subroutine mechanisms, thereby completing the feedback cycle from programming style to hardware design and back to programming style.
There are certain situations, however, in which hardware call and return instructions cannot easily, if at all, be used directly to implement subroutine calls and returns. One such situation arises in the context of binary translation.
To understand binary translation, one must distinguish between programs in xe2x80x9csourcexe2x80x9d form, as opposed to xe2x80x9cbinaryxe2x80x9d form. The source is the representation that programmers create, modify and extend. Binary programs, on the other hand, result from translation of source into a form optimized for execution. Usually, the binary form is densely encoded and non-textual (from which it derives its name), comprising bit-patterns that represent machine instructions. Besides the textual versus non-textual representation difference, symbolic names in source code may be replaced by absolute memory addresses in binary code, comments found in source code may be absent from binary code, and other information that is inessential for execution may be removed. The process by which binary code is obtained from source code is usually fully automatic and is known as compilation for high-level source languages and assembly for low-level machine code source languages (xe2x80x9cassembler languagesxe2x80x9d).
Consider now what would happen if the need were to arise to execute a given program on a platform different from the one for which it was originally developed. There may be several reasons why this could happen, for example, the original platform may no longer be available or economically attractive. Ordinarily, the program would need to be xe2x80x9cportedxe2x80x9d to the new platform. For programs written directly in the machine code of the original platform, porting may entail an almost complete rewrite of the program since the source program may be intimately tied to the original platform. Thus, the porting effort may be substantial and costly.
The situation is somewhat better for programs written in a high-level language and subsequently compiled into binary form. Often, in this case, the bulk of the source code needs only a few modifications before it can be recompiled for the new platform. On the other hand, recompilation rarely accomplishes 100% of the porting task; several things can get in the way.
Most programs depend on other software, including systems software, for performing basic tasks like file input and output, or application software libraries such as graphical user interfaces. Sometimes, these libraries are unavailable on the new platform and thus require the porting effort to extend beyond the core piece of software. In other cases, parts of the source code for the original application may have been lost, or over time may have become outdated as problems were corrected and extensions added to the software by xe2x80x9cpatchingxe2x80x9d the binary program. In yet other cases, no compiler may be available on the new platform for the source programming language. The porting effort, then, must include a source-to-source transformation, the porting of a compiler for the programming language, or a rewrite of the program. Thus, it may be appreciated that in many cases, the costs of porting a program from one platform to another may be substantial, even if secondary effects such as the need to retest and validate the software on the new platform are ignored.
In this situation, binary translation may be an attractive alternative to program-for-program porting. In a binary translation system, a piece of controlling software, namely, the binary translator, is placed between the hardware of the new platform and the binary of the program for the old platform. Stated simply, the binary translator will translate an old-plafform binary program instruction-by-instruction into equivalent instructions for the new platform, in some implementations also interleaving the translation process with the execution of the resulting new-plafform instructions. At the loss of some efficiency due to the binary translation process, this provides the general ability to execute old-plafform binaries in an unmodified (and unported) form on the new platform.
Research prototypes as well as commercially available binary translators have been built for a number of systems, including the FX!32 translator from Compaq/DEC, which allows execution of Intel x86 binaries on an Alpha processor, and the MAE system, which allows execution of Macintosh programs on Solaris/SPARC platforms. Binary translation has also been used to allow older Macintosh 68K programs to execute on newer PowerPC Macintosh computers. Perhaps the most common use of binary translation is found in high-performance Java virtual machines, which translate Java byte-code into instructions that can be executed directly by the underlying hardware. Representative articles describing binary translation include:
xe2x80x9cThe Design of a Resourcable and Retargetable Binary Translator,xe2x80x9d Cristina Cifuentes, Mike Van Emmerik, Norman Ramsey, Proceedings of the Sixth Working Conference on Reverse Engineering, Atlanta, USA, October 1999, IEEE-CS Press, pp. 280-291;
xe2x80x9cCompiling Java Just in Time,xe2x80x9d Timothy Cramer, Richard Friedman, Terrence Miller, David Seberger, Robert Wilson, and Mario Wolczko, IEEE Micro, May/June 1997;
xe2x80x9cDAISY: Dynamic Compilation for 100% Architectural Compatibility,xe2x80x9d Kemal Ebcioglu and Erik R. Altman, 24th Annual International Symposium on Computer Architecture Denver, Colorado, June 1997, pp. 26-37; and
xe2x80x9cBinary Translation,xe2x80x9d Richard L. Sites, Anton Chernoff, Matthew B. Kirk, Maurice P. Marks, and Scott G. Robinson, Communications of the ACM, 36(2), February 1993.
Binary translation offers valuable capabilities even when employed within a single hardware platform, that is, when the input and output instruction set of the binary translator are identical.
Different binary translators may offer a variety of capabilities and make different assumptions about the input binary programs. For example, one class of binary translators, to which the aforementioned FX!32 and the Java translators belong, assume that the code being translated is at xe2x80x9cuser level,xe2x80x9d containing only code that executes in a restricted mode on the CPU and adhering to the (binary) application program interfaces (API""s) defined by the combination of the hardware and operating systems of the platform. Other binary translators, such as HP""s Dynamo optimizing binary translator, make the further assumption that programs use no reflective operations, ruling out actions such as explicitly accessing or manipulation the return address entries on the stack by means other than performing subroutine invocations and returns.
Essentially, the more assumptions the binary translator makes, the fewer restrictions remain on how the binary program can be transformed in the translation step, allowing for higher performance. In an ideal world, assumptions would be unnecessary, and analysis could extract facts about the program being processed, thereby allowing maximal efficiency within the constraints set by the behavior of each program. The present state of the art in binary code analysis, however, provides only limited capabilities and often incurs considerable analysis costs. Thus, the differentiation between binary translators that make fewer versus more assumptions is justified, at least for the time being.
Binary translators that make no assumptions about the behavior of the translated program may be termed xe2x80x9cunrestricted.xe2x80x9d Such unrestricted translators generally need to fully preserve the illusion that the binary program is executing on the original platform, despite the fact that binary translation is being used. When no assumptions about the behavior of the program are made, this generally requires that the binary translator should faithfully preserve all data structures in memory as they would have appeared, had the program been executing on the original platform.
Consider an unrestricted binary translator that processes a sequence of instructions in an input binary language (IL) generated by a guest system into a corresponding sequence of instructions in the output binary language (OL) of a host system. For example, the IL might be the instruction set specified by the SPARC v9 architecture (see xe2x80x9cThe SPARC Architecture Manual,xe2x80x9d David L. Weaver, Tom Germond (Eds.), PTR Prentice Hall, Englewood Cliffs, N.J., 1994), and the OL might be the instruction set specified by the Intel Pentium architecture, commonly referred to as xe2x80x9cx86xe2x80x9d (see xe2x80x9cPentium Pro Family Developer""s Manual,xe2x80x9d Volume 1-3. Intel Corporation, 1996).
Note that it is also possible for IL and OL to be the same language. For the sake of clarity and to minimize the notational burden, without loss of generality, it is assumed in the following discussion that IL and OL both refer to x86-like languages. While actual computer systems may provide instruction sets that differ in some ways, the semantics used below for IL and OL call and return instructions are representative of almost all modern instruction set architectures (ISA""s); those skilled in the art will easily be able to apply the teachings of the various aspect of the invention to any given IL and OL.
Now recall the effect and possible translations of call and return instructions in the IL language. The most common form of call instruction is as follows (text after xe2x80x9c;xe2x80x9d is a comment):
When executed, this call instruction will:
1) Push the address R of the following instruction onto the stack.
2) Set the program counter (PC), which on x86 platforms is named %eip, to the address P.
FIG. 1 illustrates the contents of the stack before this call. FIG. 2 illustrates the contents of the stack after execution of the call instruction. Note that, in the x86 instruction set, xe2x80x9c%espxe2x80x9d designates the top-of-stack pointer register and that, on the x86, stacks grow from higher toward lower addresses.
In other words, in the x86 ISA, the effect of the call instruction is to push the return address R onto the stack. Now the subroutine at address P executes, possibly making use of the stack to hold temporary data or make further calls, that is, possibly pushing additional items onto the stack (but ordinarily never allowing the top of the stack to recede back over the cell containing xe2x80x9cRxe2x80x9d). By the time that the subroutine at P has completed and is ready to return, the stack must have returned to the state shown in FIG. 2. To return, the subroutine executes:
which will pop the topmost element from the stack into the program counter %eip, that is, it will set %eip equal to R and update %esp so that the stack becomes as illustrated in FIG. 3.
An unrestricted binary translator will generally have to translate an IL-call instruction into a sequence of OL-instructions that have the same effect on the stack as the IL-call would have had, had it executed directly on the IL platform. For example, the call may be translated like this:
This translation of the call is very efficient: It causes only a minimal slow-down (two OL instructions versus one IL instruction) and faithfully preserves all IL state. In this and the following examples, the arrow denotes translation of an IL instruction into one or more OL instructions. Single quotes (xe2x80x2) are used to indicate addresses in the output (translated) domain. In the example above, the subroutine at address P in the IL domain has been translated into OL instructions that are placed at address Pxe2x80x2.
In general, unrestricted binary translators must assume that the program being translated may inspect its own code, so the translator places the OL code at a different address than the IL code in order to keep the IL code accessible and unchanged. One way to ensure this is for the binary translator to store the OL instructions in a translation cache located in an area of memory isolated from, for example, beyond the addressable memory limits of, the original program. Moreover, OL-instruction sequences may be longer than IL-sequences, so even if no self-inspection takes place, lack of space may rule out placing OL instructions at the original IL addresses.
Note that to faithfully preserve all state in memory, including the stack, the translated instructions must push the untranslated return address xe2x80x9cRxe2x80x9d onto the stack: Before returning from the subroutine call, the IL program might execute instructions to inspect the value at the top of the stack. Since this stack location would contain the return address R absent binary translation, it must also contain R even with binary translation.
Consider now the translation of a return. A return has two effects: First, it sets the program counter %eip to the value at the top of the stack and it pops this value by updating the stack pointer so that it points to the next higher (or lower, as the case may be) address on the stack. In other words, %esp:=%esp+1. (Note that in the case of the x86 architecture, in which each word is four bytes long, the actual instruction would be %esp:=%esp+4. Incrementing by one is assumed in this discussion merely for the sake of simplicity. The actual amount by which the stack pointer is incremented (or, equivalently, decremented, depending on the architecture) will of course depend on the architecture for which the invention is implemented.) The second effect (updating %esp) can be easily achieved in the translated domain OL.
Achieving the first effect is, however, harder, because it is necessary to set the machine""s program counter to the translated return address Rxe2x80x2. But the top of the stack, by the above translation of the call, does not contain Rxe2x80x2, but rather R. If the system were to set %eip to R, then execution would incorrectly proceed to execute untranslated IL code after the return. The code produced by the translator for returns must therefore map the IL return address found on the top of the stack to an OL return address. This requires a translation of returns of this schematic form:
Here, xe2x80x9cILToOLAddress( )xe2x80x9d is a place-holder for an OL instruction sequence that maps an IL address to an OL translated address. This instruction sequence can be long. In order to perform the mapping efficiently, one or more registers may be required. Before registers can be used, however, their current contents must be saved to memory so that after the temporary needs of the return translation have been fulfilled, the registers can be restored to the values that the binary program expects. To illustrate, assume that two registers, %eax and %ebx, and the processor status register, %flags, will be used by the return translation sequence. Mathematically, the mapping from IL to OL addresses can be represented as a set of IL/OL address pairs, with the further property that there is at most one OL address associated with any given IL address. To map an IL address to an OL address, the system locates the unique pair whose first component is the IL address in the pair; the sought OL address is then the second component of that pair.
Standard techniques for implementing mappings from one set of values (for example, IL addresses) to another set of values (for example, OL addresses), use data structures such as hash tables, binary trees, orxe2x80x94for small mappingsxe2x80x94flat arrays searched linearly. The problem with all these data structures is that even though they are optimized, they are still relatively slow when used in the translation of return instructions: A single IL return instruction is converted into a sequence of OL instructions that, among other things, perform the mapping from IL to OL addresses. Because of this expansion, whereas the original IL return may execute in just a handful of CPU cycles, the OL sequence could require dozens of cycles. Moreover, since subroutine calls and returns are very frequent, the result is a noticeable slowdown when programs execute in a binary translation system.
Other techniques for implementing control-flow changes, including returns, involve variations of a technique known as inline caching, which was first introduced by Deutsch and Shiffman in xe2x80x9cEfficient Implementation of the Smalltalk-80 System,xe2x80x9d Conference Record of the Eleventh Annual ACM Symposium on Principles of Programming Languages, pp. 297-302, Salt Lake City, Utah, 1984. According to these techniques, at the control-flow transfer site (for example, at the site of a translated return), the last translated target to which the transfer went is cached. When sufficient locality exists, such that transfers repeatedly go to the same target (or targets), these inline caches can yield very high performance. Empirical studies have indicated, however, that these techniques are prone to high miss rates when employed for returns, at least for some code. When the miss rates become too high, performance will be dominated by the slower backup strategies that handle misses, which potentially cause more performance to be lost in the xe2x80x9cmissxe2x80x9d cases than were gained over the conventional solution in the xe2x80x9chitxe2x80x9d cases. For example, the inline caches will tend to miss when multiple callers alternate to call a subroutine, because every return would be transferring back to a different target than the previous time.
Yet another prior art technique involves using a second stack separate from the primary stack introduced above. To avoid confusion, one may refer to this second stack as the xe2x80x9cshadowxe2x80x9d stack. Commonly, shadow stacks have been employed in hardware, where they are often denoted by names like xe2x80x9con-chipxe2x80x9d return stacks. In principle, however, they can also be implemented in software, which is the form described here.
A shadow stack is stored in a memory area separate from the primary stack and has its own stack pointer. The shadow stack may be of smaller capacity than the primary stack, in which case it will occasionally be unable to provide assistance when returning. When used in binary translation, the shadow stack is hidden from the program being executed using binary translation. Similarly, in hardware designs, the xe2x80x9con-chipxe2x80x9d return stack is often non-architected state, meaning that there are no instructions to manipulate it directly.
Using a shadow stack, one may translate a call into the following schematic sequence:
push IL return address R on primary stack
push OL return address Rxe2x80x2 on shadow stack
jump to entry point of the translated subroutine
Returns May be Translated to:
pop IL return address R from the primary stack
pop OL return address Rxe2x80x2 from the shadow stack
verify that the IL and OL items xe2x80x9cmatchxe2x80x9d, that is, that the Rxe2x80x2 popped from the shadow stack corresponds to the R popped from the primary stack;
jump to the OL return address Rxe2x80x2 obtained in the second step
The underlying assumption here is that testing whether IL and OL addresses correspond to each other can be done faster than computing one (OL) from the other (IL). One way to speed up the step of verifying the correspondence between R and Rxe2x80x2 is to push both R (IL) and Rxe2x80x2 (OL) onto the shadow stack as a pair. Because stack operations are fast, this is an efficient way to ensure that the xe2x80x9ccorrectxe2x80x9d IL/OL pairings are available to the system. The following schematic instruction sequence can be used to accomplish this:
push IL return address R on primary stack
push OL return address Rxe2x80x2 on shadow stack
push R on the shadow stack
jump to entry point of the translated subroutine
Returns May be Translated to:
pop IL return address R from the primary stack
pop X from the shadow stack
pop Xxe2x80x2 from the shadow stack
verify that X=R and
if so, then jump to Xxe2x80x2, which will be the correct Rxe2x80x2
if not, then map the IL address to the correct OL address and jump to the OL address
The shadow stack scheme is particularly attractive in hardware because the two push operations in the call sequence and the two pop operations in the return sequence can run in parallel. As a software technique, shadow stacks have certain disadvantages. For example, the cost of pushing items onto the shadow stack (in the call sequence) and popping items from the shadow stack (in the return sequence) can be substantial. It is usually not possible to dedicate a processor register to hold the shadow stack pointer, so this stack pointer frequently will be loaded from memory and saved back to memory. Moreover, boundary checks to prevent shadow stack overflow or underflow may also add costs.
What is needed is a system and a method that enables translation of calls and returns in a manner that a) overcomes the slowness of traditional mappings from IL to OL addresses; b) is less prone than inline caches to high miss rates; and c) permits a more efficient translation of calls than does a software implementation of a shadow stack. This invention accomplishes this.
According to the invention, subroutine calls and returns are implemented in a computer system by first converting a sequence of input language (IL) instructions of a guest system into a corresponding sequence of output language (OL) instructions of a host system, which executes the OL instructions. Conversion is preferably done by a binary translator. For each call to an IL subroutine made from an IL call site in the IL instruction sequence, a correct IL return address R is stored on a stack. A first hint index is calculated, preferably by evaluating a predetermined hint function with R as an argument, and preferably also as part of the IL-to-OL instruction conversion step. A corresponding correct OL return address Rxe2x80x2 is stored in a return target cache at a location determined by the first hint index and the OL subroutine translation of the called IL subroutine is executed.
Upon completion of execution of the OL subroutine translation, a current value is retrieved from the stack; a second hint index is calculated by evaluating the hint function with the value retrieved from the stack as the argument; a target address is retrieved from a location in the return target cache determined by the second hint index; and execution is then continued, beginning at the target address.
In the preferred embodiment of the invention, if the target address is not the correct OL return address, then execution is transferred to a back-up return address recovery module, which reconstructs the correct OL return address using a predetermined, secondary address recovery routine.
In the most common case, in which there is a plurality of IL call sites, the system, in particular, the preferred binary translator, translates each IL call site into a corresponding OL call site and inserts a confirmation block of instructions into each OL call site. Whenever any confirmation block of instructions is executed, the value retrieved from the stack is compared with the correct IL return address corresponding to the current OL call site. If the value retrieved from the stack is equal to the correct IL return address, then execution of the OL instructions is continued following the OL call site. If, however, the value retrieved from the stack is not equal to the correct IL return address, then execution is transferred to the back-up return address recovery module.
In the preferred embodiment of the invention, the return target cache is an array that has a plurality of elements. The return target cache is preferably initialized by storing in each element the beginning address of the back-up return address recovery module.
The hint function preferably maps IL return addresses substantially uniformly over the return target cache. A particularly efficient and fast hint function used in the preferred embodiment of the invention forms a bitwise logical AND between bits of the IL return address R and a predetermined constant.
Let equality and inequality between the value retrieved from the stack and the correct IL return address be defined as a xe2x80x9chitxe2x80x9d and a xe2x80x9cnon-hit,xe2x80x9d respectively. The invention may also include a feature that reduces the likelihood of non-hits by adjusting the size of the return target cache according to a predetermined function of a return success measure, which measures the frequency of occurrence of hits relative to the frequency of occurrence of non-hits.