This application relates in general to run-time optimizers, and in specific to a mechanism for instruction profiling and trace selection.
A run-time optimizer is an adaptive software system that transparently optimizes applications at run-time. The optimizer rewrites the binary code of an application on-the-fly to achieve a higher execution efficiency.
FIG. 3 depicts prior art run time optimizer 300. The control loop 310 begins execution of a block of program code, via emulation performed by the profiling emulator 320. The profiling aspect of emulator 320 allows the control loop 310 to track the number of time the particular block of code has been executed via emulation. Note that a run time optimization system is different from a run time binary translation system, in that the latter is for architecture migration, while the former is to decrease execution time. The run time optimization system is using the emulator 320 for profiling in order to guide optimizations, i.e. the code is running on its native system. After a predetermined number of executions via emulation, the control loop 310 designates the block of code as hot code, and desirable for optimization. The control loop 310 then activates trace selector 330 to translate the block of code. The trace selector 330 forms a trace of the instructions that comprise the block of code by following the instructions in the block. When a branch instruction is encountered, the trace selector makes a prediction as to whether the branch is taken or falls through. If the selector decides the branch is mostly taken, then the trace is formed by extending the code from the branch target block. If the selector decides not to take the branch, then the branch falls through, and the trace continues within the fall through block. The trace terminates at a backward branch predicted to take or when the trace becomes sufficiently large. After the trace is formed, the newly formed trace is optimized for the current processor. The optimized code is then placed into the code cache 340. The next time the control loop 310 encounters a condition to execute this block of code, then the control loop 310 will execute the code in the code cache 340 and not emulate the code via emulator 320.
A problem with FIG. 3 is that an emulator is required to perform profiling, i.e. the emulated code is used to determine which code is hot. Emulation is very slow; usually 50-200 times slower than native execution speed. Consequently, there is a large time penalty for determining which code is hot. Moreover, the quality of optimization is often determined by the quality of the selected trace. Branch mis-prediction in trace construction is costly, for example predicting a branch not to be taken means the remainder of the block code is traced and optimized, and if mis-predicted then that tracing and optimizing of the code subsequent to the branch is wasted. Branch mis-prediction can be minimized by maintaining a long history of branching outcomes, which is formed by continually emulating the code block. Thus, the prior art run time optimization system (RTOS) either incurs a time penalty from emulation to build a good history, or incurs a time penalty from branch mis-prediction.
Another problem with the prior art RTOS is that it attempts to translate any code that is deemed hot based on a small threshold. This problem is referred to as complex and less reliable. There are some traces that are difficult to translate, for example, it is difficult to translate a trace with a branch in the delay slot of another branch, but without a translation, the execution of the trace would be performed by software simulation or emulation. Since emulation is slow, all hot code is translated. The requirement of translating all hot code, including all the difficult traces, increases the translation time and complexity. With this software based approach, it is rather difficult to come up with an ideal threshold value. A higher threshold would incur too much overhead while a lower threshold would end up with traces in poor quality, i.e. too many traces, and traces with early exits. Note that emulation time overhead is such that a higher threshold would require the code to be emulated much longer before they get a chance to be translated.
Consequently, it is difficult for the prior art RTOS to have long cache lines to hold traces. Long cache lines are desirable because the longer the cache line, the higher the cache hit rate. However, this is difficult for the prior art RTS because traces are either inaccurately formed or require a large amount of overhead for profiling. Moreover, the traces that are formed are not reliable because of branch mis-predictions. For example, if the overhead for software based profiling takes 5% of execution time, then the run-time optimization must gain at least 5% of performance in order to break even.
The prior art RTOS has been described in terms of a pure software approach. However, another prior art approach of generating traces is in pure hardware form. However, this approach requires a great deal of complexity in the hardware to form the traces. Particularly, the hardware approach requires an additional cache for holding the trace, i.e. a trace cache. Refer to E. Rotenberg, S. Bennett, and J. E. Smith, xe2x80x9cTrace Cache: A Low Latency Approach to High-Bandwidth Instruction Fetch,xe2x80x9d Proc. Int""l Symp. MicroArchitcture, IEEE CS Press, Los Alamitos, Calif., 1996, which is incorporated by reference.
Therefore, there is a need in the art for RTOS that does not require emulation for profiling, produces reliable traces, and uses hot code to form traces.
These and other objects, features and technical advantages are achieved by a system and method which provides fast profiling and effective trace selection with some micro-architecture support. The instruction execution frequency and branch prediction information are collected at essentially no execution time cost. Therefore, the profiling and trace selection can have a much higher threshold and will select traces with a higher quality.
The inventive mechanism partitions the work between hardware and software. The hardware automatically detects which code is executed very frequently, e.g. which code is hot code. Hardware is better suited to this task than software, because software would require more overhead in making the determination, while hardware incurs essentially zero overhead. Moreover, since hardware executes the branch instructions, the hardware also keeps the branch history information to do branch prediction. When the hardware determines that a section or block of code is hot, e.g. hot code, the hardware sends a signal to the software, which is maintained in firmware. This signal informs the software and lets the software decide which trace to select. Because the trace latching is done by the software, it has more freedom and can handle more complex cases. Furthermore, the software may add some optimizations to hot code, and has the capability to form longer traces and better traces. Software is better suited for these tasks than hardware, because the software has more flexibility, less implementation complexity and is less expensive than hardware.
Therefore, it is a technical advantage of the present invention to have a hardware processor identify frequently used code and use software embedded in firmware predict and select traces from the frequently used code.
It is another technical advantage of the present invention to use the hardware for frequently used code identification and profiling, as hardware requires less overhead in making the determinations than software.
It is a further technical advantage of the present invention to use the software for trace prediction and selection, as software has more flexibility, less complexity and is less expensive than hardware.
It is a still further technical advantage of the present invention to use the instruction cache to maintain counters and branch information for the instructions stored in the instruction cache.
It is a still further technical advantage of the present invention to have the trace prediction and selection software be triggered by the counters in the instruction cache.
It is a still further technical advantage of the present invention to have the trace prediction and selection software make branch predictions based upon the branch history information.
It is a still further technical advantage of the present invention to have the trace prediction and selection software store the trace in trace memory.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and the specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims.