1. Field of the Invention
This invention relates to computer programs in general, and in particular, to a method and related system for implementing subroutine calls and returns especially in the context of a virtualized computer running on a host.
2. Description of the Related Art
It is a well known fact that software constitutes a large fraction of the cost of computer systems. On the surface, this may seem surprising since, once developed, software can be installed and replicated without limit. The cost, however, stems from the difficulty of initial development, and the need for ongoing maintenance in the form of customization, defect elimination, and development of upgrade versions over the lifetime of the software. To give an indication of the magnitude of software engineering costs, consider that writing what is now considered a medium-sized software package may require hundreds of man-years of investment initially; moreover, following the first deployment, ongoing maintenance may demand comparable resources.
One of the hardest challenges in software engineering, be it initial development or subsequent maintenance, is the management of complexity. In particular, preventing a change or addition to one part of a system from having unforeseen and undesirable consequences in another part of the system can require significant effort. Consider, for example, that a large software system may contain millions of lines of program code, any one of which could potentially interact with any other, and it may be seen that the potential for errors is quite large. This is particularly true since no individual could write all the code, nor could any individual be familiar with all of it, once written. Early on, software developers and researchers recognized that in order to work effectively in such environments where individuals have only partial knowledge of the full system, systematic development techniques must be followed.
Perhaps the most widely employed development technique involves the decomposition of software into subroutines, also known as subprograms, functions, procedures, or methods. A subroutine comprises a number of program statements and optional data structures to perform a given task. The subroutine logically encapsulates the individual statements, allowing them to be invoked (“called”) as a group from elsewhere in the program. The effect of the subroutine invocation is to execute the statements encapsulated in the subroutine. When the last such statement completes, execution returns to the point in the program where the subroutine was invoked.
With subroutines, then, instead of solving a top-level problem directly, programmers partition it into a number of smaller problems, such that a solution to the top-level problem can be obtained by combining solutions to the smaller problems: Each smaller problem's solution is encapsulated into a subroutine, enabling the large problem's solution to be expressed as a sequence of subroutine invocations. Often, but not necessarily, the decomposition follows a hierarchical pattern in which higher-level subroutines are implemented in terms of lower-level subroutines, which in turn are implemented from even lower-level subroutines, until the point where the problems have been partitioned sufficiently that solutions can be expressed directly using primitive statements from the programming language.
The use of subroutines provides multiple advantages in software development. First, complexity is reduced locally: The number of logical steps required to solve a given problem can be kept small because the solution can be expressed in terms of higher-level operations implemented in subroutines instead of in terms of the low-level primitives defined directly by the programming language. Second, complexity is reduced globally: Because subroutines encapsulate groups of statements, programmers can often reason about the interaction of subroutines rather than the interaction of individual statements across the program. Without this encapsulation, it would be very difficult to implement large-scale software systems. Third, subroutines allow for code reuse: Once a solution to a sub-problem has been implemented and made available as a subroutine, it can be used as a building block for solving many different problems; this greatly reduces the time required to implement software, since it is not necessary to start from scratch each time. It also reduces the size of programs, since general-purpose subroutines need only be provided once even though they are used in multiple places.
For all of these reasons, and more, the use of subroutines has become fundamental to software engineering. As a result, during execution of programs written in this manner, computers will execute a large number of subroutine calls and returns.
Consider now how subroutines may be implemented on contemporary computers. In other words, consider programming language implementation.
Most of the time, programmers write software in high-level programming languages such as Cobol, Fortran, Modula-2, C, C++, or Java. All of these languages provide subroutines in some form. While the details vary in terms of both syntax and semantics (especially with respect to parameter passing), many similarities remain. In particular, all these languages provide a “last-in, first-out” (LIFO) ordering on subroutine calls and returns: the last subroutine to have been called will be the first one to return. For example, let A, B, and C denote subroutines and suppose that A calls B, and B calls C. If a “return-from-subroutine” statement is executed, it will terminate the execution of subroutine C (the one called most recently) and execution will continue in subroutine B at the point that immediately follows the invocation of C. Later, a return statement in B may terminate B's invocation and take execution back to subroutine A.
Because subroutine execution respects this LIFO order, an efficient implementation can be realized by using a push-down stack. With this well known implementation technique, a subroutine invocation, such as A calling B, is performed in two steps. First, the return address is pushed onto the stack. Second, the program counter is updated to indicate the first statement of subroutine B, that is, execution “jumps” to the beginning of B. The execution of B now proceeds without regard to where it was invoked from. Eventually, a return statement in B will be encountered or, equivalently, the last statement in B will complete. In either case, to return back to its caller, subroutine B need only perform a single step: it pops the top-most item from the stack, which will be the address to which it should return, and places this value in the program counter register. Now, instruction fetching and execution will continue from the point in the caller (A in the example) that follows the call to B.
The use of a stack provides a high degree of generality. A subroutine, such as B, can correctly return back to its caller A, even if B, during its execution, performs further subroutine invocations. For instance, if A calls B, then the stack will contain “A” when the execution of B commences. Now, if B later calls C, then the stack will contain two return addresses “A; B” where B is the most recent (top-most) item. When C returns, it will pop the topmost item from the stack (B) leaving just “A” on the stack. This is the same state as before the call of C in B, so following the invocation and completion of subroutine C, B can execute to completion and return back to A by popping the return address from the stack in the usual manner. (Merely for the sake of notational simplicity, one may equate return addresses with the caller subroutine; in actual implementations, the return addresses must indicate the precise statement within the caller to which execution should return. Often, but not necessarily, this return address will be represented as a memory address.)
The stack implementation of subroutine calls and returns also allows a subroutine to invoke itself. This ability leads to a powerful programming technique known as recursion, and makes possible elegant solutions to a number of problems. During recursive subroutine invocations, the stack will contain a repeated sequence of return addresses “ . . . A; A; . . . ; A”, but more general patterns involving mutual recursion between two or more different subroutines can also be handled in the stack implementation of calls and returns.
Fundamental to many programming languages and uses of subroutines are the concepts of local state and parameters. Because these concepts are well understood in the art, they are discussed only briefly here. Most subroutines make use of local variables for carrying out their computations. In many implementations, it is desirable to allocate storage for local variables on the same stack that holds the return addresses. The stack allocation provides two advantages. First, storage is only committed to a subroutine's variables when the subroutine is active. Second, recursive subroutines can have multiple instances of these variables (one per invocation), thereby preventing awkward interference that would result if recursive invocations were to share the local variables. In addition to supporting local variables, subroutine invocation mechanisms will often also provide some way to pass parameters from the caller to the called subroutine. Simplifying slightly, one may think of these parameters as a form of local variables that are initialized by the caller.
It is common to handle all these facets of subroutine invocation by using the concept of activation records (frames). An activation record is a consecutive range of storage on the stack. The activation record contains fields for the return address, the parameters and the local variables. Each subroutine invocation will push one activation record on the stack, and each subroutine return will pop one activation from the stack. Activation records appear and disappear in the same LIFO order as subroutine invocations begin and end.
Without loss of generality, and for the sake of clarity, one may think of and refer to subroutine invocations as pushing and popping simple program counters rather than full activation records. This convention is followed below.
Non-stack-based techniques also exist for implementing subroutine calls, but they have limitations that make them less desirable than the stack approach, except when special circumstances call for their use. In one alternative technique, for example, the caller subroutine writes the return address into a known location that is associated with the called subroutine. The called subroutine then performs a return by setting the program counter to the value found in that known location. Since there is only one such location per subroutine, recursion cannot be supported. In another alternative, subroutine invocations construct a “linked list” of activation records in an object heap. In this case, recursion can be supported, but the costs of heap-allocating and reclaiming the activation records tend to be higher than the costs of using a stack. Unless there are other compelling reasons for using heap allocation, the stack approach is therefore usually considered to be superior.
Because the stack implementation offers significant advantages, it has found widespread use. In turn, and because of this frequent use, most modern CPU's implement direct support for stack-based calls and returns in their machine-code language. These low-level (“hardware”) call and return instructions are designed to facilitate translation of high-level programming language subroutine invocations and returns into their low-level counterparts. While the hardware instructions do not implement the full semantics of subroutine invocations as found in many high-level languages and the translation therefore requires the use of additional instructions (for example, for passing parameters), the hardware support for calls and returns nonetheless ensures that subroutine invocations can be implemented very efficiently. Indeed, programmers have come to rely on efficient subroutine mechanisms, thereby completing the feedback cycle from programming style to hardware design and back to programming style.
There are certain situations, however, in which hardware call and return instructions cannot easily, if at all, be used directly to implement subroutine calls and returns. One such situation arises in the context of binary translation.
To understand binary translation, one must distinguish between programs in “source” form, as opposed to “binary” form. The source is the representation that programmers create, modify and extend. Binary programs, on the other hand, result from translation of source into a form optimized for execution. Usually, the binary form is densely encoded and non-textual (from which it derives its name), comprising bit-patterns that represent machine instructions. Besides the textual versus non-textual representation difference, symbolic names in source code may be replaced by absolute memory addresses in binary code, comments found in source code may be absent from binary code, and other information that is inessential for execution may be removed. The process by which binary code is obtained from source code is usually fully automatic and is known as compilation for high-level source languages and assembly for low-level machine code source languages (“assembler languages”).
Consider now what would happen if the need were to arise to execute a given program on a platform different from the one for which it was originally developed. There may be several reasons why this could happen, for example, the original platform may no longer be available or economically attractive. Ordinarily, the program would need to be “ported” to the new platform. For programs written directly in the machine code of the original platform, porting may entail an almost complete rewrite of the program since the source program may be intimately tied to the original platform. Thus, the porting effort may be substantial and costly.
The situation is somewhat better for programs written in a high-level language and subsequently compiled into binary form. Often, in this case, the bulk of the source code needs only a few modifications before it can be recompiled for the new platform. On the other hand, recompilation rarely accomplishes 100% of the porting task; several things can get in the way.
Most programs depend on other software, including systems software, for performing basic tasks like file input and output, or application software libraries such as graphical user interfaces. Sometimes, these libraries are unavailable on the new platform and thus require the porting effort to extend beyond the core piece of software. In other cases, parts of the source code for the original application may have been lost, or over time may have become outdated as problems were corrected and extensions added to the software by “patching” the binary program. In yet other cases, no compiler may be available on the new platform for the source programming language. The porting effort, then, must include a source-to-source transformation, the porting of a compiler for the programming language, or a rewrite of the program. Thus, it may be appreciated that, in many cases, the costs of porting a program from one platform to another may be substantial, even if secondary effects such as the need to retest and validate the software on the new platform are ignored.
In this situation, binary translation may be an attractive alternative to program-for-program porting. In a binary translation system, a piece of controlling software, namely, the binary translator, is placed between the hardware of the new platform and the binary of the program for the old platform. Stated simply, the binary translator will translate an old-platform binary program instruction-by-instruction into equivalent instructions for the new platform, in some implementations also interleaving the translation process with the execution of the resulting new-platform instructions. At the loss of some efficiency due to the binary translation process, this provides the general ability to execute old-platform binaries in an unmodified (and unported) form on the new platform.
Research prototypes as well as commercially available binary translators have been built for a number of systems, including the FX!32 translator from Compaq/DEC, which allows execution of Intel x86 binaries on an Alpha processor, and the MAE system, which allows execution of Macintosh programs on Solaris/SPARC platforms. Binary translation has also been used to allow older Macintosh 68K programs to execute on newer PowerPC Macintosh computers. Perhaps the most common use of binary translation is found in high-performance Java virtual machines, which translate Java byte-code into instructions that can be executed directly by the underlying hardware. Representative articles describing binary translation include:
“The Design of a Resourcable and Retargetable Binary Translator,” Cristina Cifuentes, Mike Van Emmerik, Norman Ramsey, Proceedings of the Sixth Working Conference on Reverse Engineering, Atlanta, USA, October 1999, IEEE-CS Press, pp. 280-291;
“Compiling Java Just in Time,” Timothy Cramer, Richard Friedman, Terrence Miller, David Seberger, Robert Wilson, and Mario Wolczko, IEEE Micro, May/June 1997;
“DAISY: Dynamic Compilation for 100% Architectural Compatibility,” Kemal Ebcioglu and Erik R. Altman, 24th Annual International Symposium on Computer Architecture Denver, Colo., June 1997, pp. 26-37; and
“Binary Translation,” Richard L. Sites, Anton Chernoff, Matthew B. Kirk, Maurice P. Marks, and Scott G. Robinson, Communications of the ACM, 36(2), February 1993.
Binary translation offers valuable capabilities even when employed within a single hardware platform, that is, when the input and output instruction set of the binary translator are identical.
Different binary translators may offer a variety of capabilities and make different assumptions about the input binary programs. For example, one class of binary translators, to which the aforementioned FX!32 and the Java translators belong, assume that the code being translated is at “user level,” containing only code that executes in a restricted mode on the CPU and adhering to the (binary) application program interfaces (API's) defined by the combination of the hardware and operating systems of the platform. Other binary translators, such as HP's Dynamo optimizing binary translator, make the further assumption that programs use no reflective operations, ruling out actions such as explicitly accessing or manipulating the return address entries on the stack by means other than performing subroutine invocations and returns.
Essentially, the more assumptions the binary translator makes, the fewer restrictions remain on how the binary program can be transformed in the translation step, allowing for higher performance. In an ideal world, assumptions would be unnecessary, and analysis could extract facts about the program being processed, thereby allowing maximal efficiency within the constraints set by the behavior of each program. The present state of the art in binary code analysis, however, provides only limited capabilities and often incurs considerable analysis costs. Thus, the differentiation between binary translators that make fewer versus more assumptions is justified, at least for the time being.
Binary translators that make no assumptions about the behavior of the translated program may be termed “unrestricted.” Such unrestricted translators generally need to fully preserve the illusion that the binary program is executing on the original platform, despite the fact that binary translation is being used. When no assumptions about the behavior of the program are made, this generally requires that the binary translator should faithfully preserve all data structures in memory as they would have appeared, had the program been executing on the original platform.
Consider an unrestricted binary translator that processes a sequence of instructions in an input binary language (IL) generated by a guest system into a corresponding sequence of instructions in the output binary language (OL) of a host system. For example, the IL might be the instruction set specified by the SPARC v9 architecture (see “The SPARC Architecture Manual,” David L. Weaver, Tom Germond (Eds.), PTR Prentice Hall, Englewood Cliffs, N.J., 1994), and the OL might be the instruction set specified by the Intel Pentium architecture, commonly referred to as “x86” (see “Pentium Pro Family Developer's Manual,” Volume 1-3. Intel Corporation, 1996).
Note that it is also possible, although not essential to this invention, for the IL and OL to be the same language. In other words, the IL and OL may express the same or substantially the same instruction set; moreover, either the IL or OL might be a subset of the other. For the sake of clarity and to minimize the notational burden, without loss of generality, it is assumed in the following discussion that IL and OL both refer to x86-like languages. While actual computer systems may provide instruction sets that differ in some ways, the semantics used below for IL and OL call and return instructions are representative of almost all modern instruction set architectures (ISAs); those skilled in the art will easily be able to apply the teachings of the various aspects of the invention to any given IL and OL.
Now recall the effect and possible translations of call and return instructions in the IL language. The most common form of call instruction is as follows (text after a semicolon “;” is a comment):
call P;call the subroutine that begins at address PR: <some IL;instruction following call is at address R. This is theinstruction>;instruction to be executed after return from the call to P
When executed, this call instruction will:
1) Push the address R of the following instruction onto the stack.
2) Set the program counter (PC), which on x86 platforms is named % eip, to the address P.
FIG. 1 illustrates the contents of the stack before this call. FIG. 2 illustrates the contents of the stack after execution of the call instruction. In the x86 instruction set, “% esp” designates the top-of-stack pointer register and stacks grow from higher toward lower addresses.
In other words, in the x86 ISA, the effect of the call instruction is to push the return address R onto the stack. Now the subroutine at address P executes, possibly making use of the stack to hold temporary data or make further calls, that is, possibly pushing additional items onto the stack (but ordinarily never allowing the top of the stack to recede back over the cell containing “R”). By the time that the subroutine at P has completed and is ready to return, the stack must have returned to the state shown in FIG. 2. To return, the subroutine P executes:
ret ;return to the caller of this subroutineThis will pop the topmost element from the stack into the program counter % eip, that is, it will set % eip equal to R and update % esp so that the stack becomes as illustrated in FIG. 3.
An unrestricted binary translator will generally have to translate an IL-call instruction into a sequence of OL-instructions that have the same effect on the stack as the IL-call would have had, had it executed directly on the IL platform. For example, the call may be translated like this:
call P →push RR:jmp P′
This translation of the call is very efficient: It causes only a minimal slow-down (two OL instructions versus one IL instruction) and faithfully preserves all IL state. In this and the following examples, the arrow denotes translation of an IL instruction into one or more OL instructions. Single quotes (′) are used to indicate addresses in the output (translated) domain. In the example above, the subroutine at address P in the IL domain has been translated into OL instructions that are placed at address P′.
In general, unrestricted binary translators must assume that the program being translated may inspect its own code, so the translator places the OL code at a different address than the IL code in order to keep the IL code accessible and unchanged. One way to ensure this is for the binary translator to store the OL instructions in a translation cache located in an area of memory isolated from, for example, beyond the addressable memory limits of, the original program. Moreover, OL-instruction sequences may be longer than IL-sequences, so even if no self-inspection takes place, lack of space may rule out placing OL instructions at the original IL addresses.
Note that to faithfully preserve all state in memory, including the stack, the translated instructions must push the untranslated return address “R” onto the stack: Before returning from the subroutine call, the IL program might execute instructions to inspect the value at the top of the stack. Since this stack location would contain the return address R absent binary translation, it must also contain R even with binary translation.
Consider now the translation of a return. A return has two effects: First, it sets the program counter % eip to the value at the top of the stack and it pops this value by updating the stack pointer so that it points to the next higher (or lower, as the case may be) address on the stack. In other words, % esp:=% esp+1. In the case of the x86 architecture, in which each word is four bytes long, the actual instruction would be % esp:=% esp+4. Incrementing by one is assumed in this discussion merely for the sake of simplicity. The actual amount by which the stack pointer is incremented (or, equivalently, decremented, depending on the architecture) will of course depend on the architecture for which the invention is implemented. The second effect (updating % esp) can be easily achieved in the translated domain OL.
Achieving the first effect is, however, harder, because it is necessary to set the machine's program counter to the translated return address R′. But the top of the stack, by the above translation of the call, does not contain R′, but rather R. If the system were to set % eip to R, then execution would incorrectly proceed to execute untranslated IL code after the return. The code produced by the translator for returns must therefore map the IL return address found on the top of the stack to an OL return address. This requires a translation of returns of this schematic form:
ret --->save scratch registers %eax, %ebx, %flagspop %eax              ;pop IL return address%eax := ILToOLAddress(%eax)store (Return_Target), %eaxrestore scratch registers %eax, %ebx, %flagsjmp (Return_Target)
Here, “ILToOLAddress( )” is a place-holder for an OL instruction sequence that maps an IL address to an OL translated address. This instruction sequence can be long. In order to perform the mapping efficiently, one or more registers may be required. Before registers can be used, however, their current contents must be saved to memory so that after the temporary needs of the return translation have been fulfilled, the registers can be restored to the values that the binary program expects. To illustrate, assume that two registers, % eax and % ebx, and the processor status register, % flags, will be used by the return translation sequence. Mathematically, the mapping from IL to OL addresses can be represented as a set of IL/OL address pairs, with the further property that there is at most one OL address associated with any given IL address. To map an IL address to an OL address, the system locates the unique pair whose first component is the IL address in the pair; the sought OL address is then the second component of that pair.
Standard techniques for implementing mappings from one set of values (for example, IL addresses) to another set of values (for example, OL addresses), use data structures such as hash tables, binary trees, or—for small mappings—flat arrays searched linearly. The problem with all these data structures is that even though they are optimized, they are still relatively slow when used in the translation of return instructions: A single IL return instruction is converted into a sequence of OL instructions that, among other things, perform the mapping from IL to OL addresses. Because of this expansion, whereas the original IL return may execute in just a handful of CPU cycles, the OL sequence could require dozens of cycles. Moreover, since subroutine calls and returns are very frequent, the result is a noticeable slowdown when programs execute in a binary translation system.
Other techniques for implementing control-flow changes, including returns, involve variations of a technique known as inline caching, which was first introduced by Deutsch and Shiffman in “Efficient Implementation of the Smalltalk-80 System,” Conference Record of the Eleventh Annual ACM Symposium on Principles of Programming Languages, pp. 297-302, Salt Lake City, Utah, 1984. According to these techniques, at the control-flow transfer site (for example, at the site of a translated return), the last translated target to which the transfer went is cached. When sufficient locality exists, such that transfers repeatedly go to the same target (or targets), these inline caches can yield very high performance.
Empirical studies have indicated, however, that these techniques are prone to high miss rates when employed for returns, at least for some code. When the miss rates become too high, performance will be dominated by the slower backup strategies that handle misses, which potentially cause more performance to be lost in the “miss” cases than were gained over the conventional solution in the “hit” cases. For example, inline caches will tend to miss when multiple callers alternate to call a subroutine, because every return would be transferring back to a different target than the previous time.
Instead of an inline cache, some systems use a hash table to return from subroutine calls. A significant drawback of this solution is that the code sequence needed to probe the table is often even longer than the code sequence required to deal with inline caching.
Yet another prior art technique involves using a second stack separate from the primary stack introduced above. To avoid confusion, one may refer to this second stack as the “shadow” stack. Commonly, shadow stacks have been employed in hardware, where they are often denoted by names like “on-chip” return stacks. In principle, however, they can also be implemented in software, which is the form described here.
A shadow stack is stored in a memory area separate from the primary stack and has its own stack pointer. The shadow stack may be of smaller capacity than the primary stack, in which case it will occasionally be unable to provide assistance when returning. When used in binary translation, the shadow stack is hidden from the program being executed using binary translation. Similarly, in hardware designs, the “on-chip” return stack is often non-architected state, meaning that there are no instructions to manipulate it directly.
Using a shadow stack, one may translate a call into the following schematic sequence:
push IL return address R on primary stack
push OL return address R′ on shadow stack
jump to entry point of the translated subroutine
Returns may be translated to:                pop IL return address R from the primary stack        pop OL return address R′ from the shadow stack        verify that the IL and OL items “match”, that is, that the R′ popped from the shadow stack corresponds to the R popped from the primary stack;        jump to the OL return address R′ obtained in the second step        
The underlying assumption here is that testing whether IL and OL addresses correspond to each other can be done faster than computing one (OL) from the other (IL). One way to support fast verification of the correspondence between R and R′ is to push both R (IL) and R′ (OL) onto the shadow stack as a pair. Because stack operations are fast, this is an efficient way to ensure that the “correct” IL/OL pairings are available to the system. The following schematic instruction sequence can be used to accomplish this:
push IL return address R on primary stack
push OL return address R′ on shadow stack
push R on the shadow stack
jump to entry point of the translated subroutine
Returns may be translated to:
pop IL return address R from the primary stack
pop X from the shadow stack
pop X′ from the shadow stack
verify that X=R and                if so, then jump to X′, which will be the correct R′        if not, then map the IL return address R to the correct OL address and jump to the OL address R′        
The shadow stack scheme is particularly attractive in hardware because the shadow stack push and pop operations in the call and return sequences can run in parallel with the push and pop sequences on the regular stack. As a software technique, shadow stacks have certain disadvantages. For example, the cost of pushing items onto the shadow stack (in the call sequence) and popping items from the shadow stack (in the return sequence) can be substantial. It is usually not possible to dedicate a processor register to hold the shadow stack pointer, so this stack pointer frequently will be loaded from memory and saved back to memory. Moreover, boundary checks to prevent shadow stack overflow or underflow may also add costs.
Co-pending U.S. patent application Ser. No. 09/668,091, “Method and System for Implementing Subroutine Calls and Returns in Binary Translation Sub-systems of Computers,” (the '091 application) filed 22 Sep. 2000 by the inventor of the present application, discloses an improvement on the existing techniques described above and avoids many of their drawbacks.
Central to the scheme in the '091 application is a data structure—for example, an array of 64 cells—that makes up a return target cache (rtc). The contents of the return target cache array are OL return addresses; hashed IL return addresses are used to compute an index into the rtc. The translated code for a call stores a value into rtc[.] as a “hint” to help the return launch block “find” the right target. Expressed in typical opcode (pseudo-assembly code) form, a typical call to a procedure P, where the call's return address is R, is translated as follows in the '091 scheme:
call P →push RR:rtc[R & 63] := R′jump P′As before, P′ denotes the OL address corresponding to the IL address P. The expression R & 63 represents the bitwise AND of R and 63 and operates as a hash function. Because R is a compile-time constant when translating the call, the expression R & 63 can also be evaluated at compile time.
In the '091 system, the launch block (instruction sequence) that would be executed whenever a return instruction is encountered, expressed in x86 instructions, would be as follows:
Launch block:
ILOLret →save scratch registers %eax, %ebx and %flagspop %eax;pop IL return address Rmov %ebx, %eax;copy R to %ebxand %ebx, 63;%ebx = R & 63 (hash function)jmp (rtc_Base + c·%ebx);jump to hinted targetwhere rtc_Base is the base address of the rtc (see FIG. 5) and the constant c is a scale factor applied to account for the array elements being more than one byte apart in address. Such address scaling is well understood in the art of translating higher level language constructs such as arrays (for example, rtc[R & 63]:=R′) into machine instructions.
The final jump in the above code directs execution to a confirmation sequence, which is associated with the call site. The launch block provides the return address in % eax, so the sequence of instructions comprising the confirm block simply verifies that it has the “right” place (since the confirm block is associated with a call site, “R” is a constant value):
ILOLcall P →push R;push IL return addressR:store (RTC), R′;set return target cache hintjmp P′;jump to translated routineR′: cmp %eax, R;return to right place?jne Miss/Failure;jump to “back-up” code ifwrongrestore %eax, %ebx, %eflags. . .continue in translatedcode. . .where the three instructions from push to jmp, inclusive, comprise the translated call and the instructions from cmp to restore, inclusive, comprise the confirm block proper.
Here, as is well known, “cmp” and “jne” are the x86 instructions for “compare” and “jump if not equal”; the other instructions are immediately obvious to those skilled in the art.
The mechanism described in the '091 application has certain other details and preferred features such as initializing rtc[·] so that all entries point to a miss handler, but the central idea is that a moderately sized array can be used to connect calls and returns. The IL instruction stream, and therefore the behavior-improving OL instruction stream that results from translation, will “carry” the value needed (the IL return address R) to hash into the array from procedure entry to return on its stack. The above instruction sequence is more compact than what would be possible using a shadow stack to carry OL return addresses.
Although an improvement over the prior art, the mechanism described in the '091 application may not be optimal for every application. In particular, because it computes a hash of the return address, it requires a hash computation in the launch block. Moreover, it assumes the availability and use of two scratch registers (for example, % eax and % ebx), although this requirement may be relaxed in architectures other than the x86. Below, the term “mechanism for hashing return destination addresses” refers to the mechanism for returning from subroutines disclosed in the '091 application.
Increasing the relative rate of hits is not the only important consideration when designing a system for implementing subroutine calls in the context of binary translation. In some binary translation implementations, for example, the capacity of the translation cache (TC) may be limited. In such systems, it is important to choose a translation for calls and returns that can be expressed using short OL instruction sequences to avoid using too much of the TC space. In each of the prior art systems mentioned above, the designer must therefore also try to ensure that the OL instruction sequences generated by the binary translator are chosen so as to optimize TC space usage. This optimization will depend on the given OL instruction set architecture.
What is needed is a system and a method that enables translation of calls and returns in a manner that a) overcomes the slowness of traditional mappings from IL to OL addresses; b) is less prone than inline caches to high miss rates; c) preferably generates less code so that less of the translation cache is needed; and d) permits a more efficient translation of calls than does a software implementation of a shadow stack.
In some applications, it will be faster or otherwise preferable to reduce the path length of the return mechanism, to avoid the need for a hash computation in the launch block, and perhaps to make do with single scratch register. It would therefore be good to have an alternate mechanism with these improvements, which still meets the needs a)-d) mentioned above. The present invention provides such a mechanism.