Prior art systems for improving the performance of applications (i.e. run time optimizations) may be based on one of several approaches. One approach is to provide compilers known as "optimizers" or "optimizing compilers" which serve to improve the run time of a given application while it is executing. In prior art approach, the compiler software or runtime optimizer takes complete control of the existing binary. The runtime optimizer then determines which code is used most often (i.e. the "hot code"), translates the hot code, and emulates the remaining code. When branches are encountered in the emulated code, the target address is checked to see if there is a translation. If there is a translation, the system will execute the translated code instead of the original code.
In the prior art approaches to the running of an application on a computer, the binary of a computer application is not actually executed. This is because the emulator controls the process, such that the emulator actually emulates the existing binary and checks on each control transfer instruction (branch, call, return, etc.). However, this approach is problematic because the system needs to translate a lot of code to get the requisite speed. If the system has certain code and does not translate it, it will emulate that code, in which case a prior art system will spend too much time on the given task.
Such a system is expressed in FIG. 4. FIG. 4 depicts prior art run-time optimizer 130. The control loop 131 begins execution of a block of program code via emulation performed by the profiling emulator 132. The profiling aspect of emulator 132 allows the control loop 131 to track the number of times the particular block of code has been executed via emulation. Note that a run-time optimization system is different from a run-time binary translation system, in that the latter is usually for architecture migration while the former is to decrease execution time. The run-time optimization system is using the emulator 132 for profiling in order to guide optimizations, i.e. the code is running on its native system. After a predetermined number of executions via emulation, the control loop 131 designates the block of code as hot code, and thus, desirable for optimization. The control loop 131 then activates trace selector 133 to form a trace and then translate/optimize the formed trace. The trace selector 133 forms a trace of the instructions that comprise the block of code by following the instructions in the block. When a branch instruction is encountered, the trace selector makes a prediction as to whether the branch is taken or falls through. If the selector decides the branch is mostly taken, then the trace is formed by extending the code from the branch target block. If the selector decides not to take the branch, then the branch falls through, and the trace continues within the fall through block. The trace terminates at a backward branch predicted to be taken or when the trace becomes sufficiently large. After the trace is completed, the code is rewritten with machine dependent and machine independent optimizations. The optimized code is then placed into the code cache 134. The next time the control loop 131 encounters a condition to execute this block of code, then the control loop 131 will execute the code in the code cache 134 and not emulate the code via emulator 132.
As shown in FIG. 5, if the target of a branch which is taken to exit trace 1, as shown by branch instruction 141, then control is returned to the run-time system optimizer 130 and to control loop 131, which determines if the target resides in the code cache. If the target resides in code cache, then the control loop 131 modifies the target of the branch instruction 141 to be the trace 2 in 142 in code cache as shown by branch instruction 143. This modification is called backpatching. Thus, if the exit of the trace is already translated, then the branch is backpatched such that a subsequent execution will directly branch to the new trace without returning to the control loop. Backpatching increases the speed of execution of the code, because returning to the control loop significantly slows down execution time.
A particular problem with the prior art optimizer is that it cannot backpatch an indirect branch. The optimizer cannot backpatch an indirect branch because the target address is unknown. The target address is typically in a register or memory location and not written directly in code. Thus, the optimizer will shift control back to the control loop 141 to determine whether the target address has been translated, which is expensive in terms of time. The prior art has attempted to minimize this problem by inlining a code sequence to search a smaller lookup table in the optimized traces, however, these mechanism still incur high overhead. Without such a lookup, the branch will switch execution back to the original binary, and not return to trace execution. Moreover, if small table lookup fails then the execution will shift control back to the control loop, as described above. Examples of indirect branches are return branches and switch branches.
Another problem with the prior art optimizer is that it attempts to translate any code that is deemed hot based on a small threshold. A higher threshold would incur too much overhead while a lower threshold would end up with traces in poor quality, i.e. too many traces, and traces with early exits. Thus, it is rather difficult to come up with an ideal threshold value. Note that emulation time overhead is such that a higher threshold would require the code to be emulated much longer before they get a chance to be translated. Without a translation, the execution of the trace would be performed by software simulation or emulation. However, given that emulation is necessarily slow, all hot code is usually translated but this can present a problem, some traces are very difficult to translate. For example, it is difficult to translate a trace with branches in the delay slot of another branch. The requirement of translating all hot code therefore increases the translation time and complexity.
Hence, there are several drawbacks to this approach. This type of prior art system necessitates emulation of the code regions of the application that have not been translated, or otherwise requires the translation of every region the very first time it is encountered. This adds significant overhead because both emulation and translation are slow. Moreover, a system cannot effectively optimize every trace identified as "hot." Some traces in the prior art optimizer cannot be optimized to run better than the original sequence. Moreover, optimization of some traces results in a reduction of performance after the optimization because the approach of the prior art must handle indirect branches efficiently in the optimized traces. For example, in PA 8000 systems made by the Hewlett-Packard Corporation of Palo Alto, Calif. (USA), procedure return branches are usually mispredicted by hardware. An effectively selected trace, which successfully predicted the target of a return branch could save misprediction cost, say about 5 cycles. However, for other architectures, such as the implementation of the new IA-64 architecture, where return branches will often by correctly predicted by return branch stack, the overhead of handling return branches would cost more than a correctly predicted branch. Hence, the approach in prior art is more suitable for architectures with expensive return branches, but not suitable for architectures with inexpensive return branches.
The second problem with the prior art concerns the necessity for yet another table lookup. When a prior art type system handles indirect branches, there are several possibilities for resolving the branch. One possibility is to proceed back to the control loop, where a system will necessarily take a big hit in cycles because such a process takes many cycles just to complete the lookup on the table of hot code. As similarly detailed above, in order to minimize such a cost in cycles, software systems often use something like a small lookup table called a software cache 24 (FIG. 2). A smaller table is used so that the system need not go back to the control loop. This routine essentially locates a smaller table to see whether the system can garner the hit from the small table. However, even with such a provision, the prior art system still pays, on the average, approximately 10 cycles even if there is a hit. If there is no hit however, then 10 cycles will be wasted, and the prior art system still needs to go back and locate a big table for lookup. Such is the overhead associated with indirect branches.
Yet another problem is associated with the prior art approach. The prior art systems necessarily translate too much code. This is problematic because systems already have a lot of space taken by the existing binary and this necessitates the production of a lot of translated code. The particular system of the prior art further teaches the use of an emulator is necessary for the handling of a branch because a system needs to know whether there is a translation for the target address. Hence, it is necessary to have a specific element in control, so that every time an address is handled, the system can look into the existing mapping table to see whether there is translation. Using an emulator tailored for the purpose of emulating the code and figuring out which code is "hot" can actually serve two purposes. One for control in determining whether the system should continue emulating, or whether it should jump to the new code cache and execute the improved code. The second purpose is to profile (i.e., to figure out whether a given segment of code is "hot" or relatively unused). However, profiling in the prior art becomes problematic because some code segment executes say, 100,000,000 times, while other segment's code executes only a couple thousand times, thereby making it difficult to categorize whether the code which executes only 2,000 times as "not hot" simply because it executed "only" a few thousand times when compared to the multitude of code out there which might execute many millions of times. However, in trying to determine out whether the code is "hot" or not, a system cannot afford to consider just this type of relevant frequency.