This application relates in general to run-time optimizers, and in specific to hardware embedded rim-time optimizer.
A run-time optimizer is an adaptive software system that transparently optimizes applications at run-time. The optimizer rewrites the binary code of an application on-the-fly to achieve a higher execution efficiency.
FIG. 4 depicts prior art run-time optimizer 30. The control loop 31 begins execution of a block of program code via emulation performed by the profiling emulator 32. The profiling aspect of emulator 32 allows the control loop 31 to track the number of times the particular block of code has been executed via emulation. Note that a run-time optimization system is different from a run-time binary translation system, in that the latter is for architecture migration while the former is to decrease execution time. The run-time optimization system is using the emulator 32 for profiling in order to guide optimizations, i.e. the code is running on its native system. After a predetermined number of executions via emulation, the control loop 31 designates the block of code as hot code, and desirable for optimization. The control loop 31 then activates trace selector 33 to translate the block of code. The trace selector 33 forms a trace of the instructions that comprise the block of code by following the instructions in the block. When a branch instruction is encountered, the trace selector makes a prediction as to whether the branch is taken or falls through. If the selector decides the branch is mostly taken, then the trace is formed by extending the code from the branch target block. If the selector decides not to take the branch, then the branch falls through, and the trace continues within the fall through block. The trace terminates at a backward branch predicted to be taken or when the trace becomes sufficiently large. After the trace is completed, the code is rewritten with machine dependent and machine independent optimizations. The optimized code is then placed into the code cache 34. The next time the control loop 31 encounters a condition to execute this block of code, then the control loop 31 will execute the code in the code cache 34 and not emulate the code via emulator 32.
As shown in FIG. 5, if the target of a branch which is taken to exit trace 1, as shown by branch instruction 41, then control is returned to the run-time system RTS 30 and to control loop 31, which determines if the target resides in the code cache. If the target resides in code cache, then the control loop 31 modifies the target of the branch instruction 41 to be the trace 242 in code cache as shown by branch instruction 43. This modification is called backpatching. Thus, if the exit of the trace is already translated, then the branch is backpatched such that a subsequent execution will directly branch to the new trace without returning to the control loop. Backpatching increases the speed of execution of the code, as returning to the RTS significantly slows down execution time.
A problem with FIG. 4 is that an emulator is required to perform profiling, i.e. the emulated code is used to determine which code is hot. Emulation is very slow, usually 50-200 times slower than native execution speed. Consequently, there is a large time penalty for determining which code is hot. Moreover, the quality of optimization is often determined by the quality of the selected trace. Poor trace selection can be costly, for example, predicting a branch not to be taken means the remainder of the block code is traced and optimized, and if mispredicted, then that tracing and optimizing of the code subsequent to the branch is wasted. Branch misprediction can be minimized by maintaining a long history of branching outcomes, which is formed by continually emulating the code block. Thus, the prior art RTS either incurs a time penalty from emulation to build a good history or incurs a time penalty from branch misprediction.
Another problem with the prior art RTS is that it cannot backpatch an indirect branch. The RTS cannot backpatch an indirect branch because the target address is unknown. The target address is typically in a register or memory location and not written directly in code. Thus, the RTS will shift control back to the control loop 31 to determine whether the target address has been translated, which is expensive in terms of time. The prior art has attempted to minimize this problem by inlining a code sequence to search a smaller lookup table in the optimized traces, however, these mechanism still incur high overhead. Examples of indirect branches are return branches and switch branches.
A further problem with the prior art RTS is that it attempts to translate any code that is deemed hot based on a small threshold. This problem is referred to as complex and less reliable. There are some traces that are difficult to translate, but, without a translation, the execution of the trace would be performed by software simulation or emulation. Since emulation is slow, all hot code is translated. Some traces are very difficult to translate. For example, it is difficult to translate a trace with branches in the delay slot of another branch. The requirement of translating all hot code increases the translation time and complexity.
A further problem with the prior art RTS is that it will handle only user code and not operating system (OS) code. This is because the RTS is layered between the user application and the OS, and thus will not handle privileged instructions and addressing modes. In the prior art, the RTS is attached to user processes. Since the prior art RTS cannot be attached to the OS, it does not handle OS code.
Therefore, there is a need in the art for a RTS that does not require emulation for profiling, can handle indirect branches without returning control to a control loop, can refuse translation of difficult code and will handle OS code.
These and other objects, features and technical advantages are achieved by a system and method which embeds the control loop in hardware and, thus, does not require emulation for profiling, can handle indirect branches, will not translate difficult code, and will handle OS code. The inventive run-time optimization system (RTOS) places the control loop in the hardware and the translation/optimization components in the firmware, which are both below the OS level. Hence, the OS code can also be optimization candidates.
The inventive RTOS handles execution profiling and transfers execution to optimized traces automatically. This would allow code to run at faster native speed instead of slower emulation. Since the code is running faster, the threshold for selecting a hot trace could be set much higher than the prior art. This would also avoid generating traces for relatively infrequent code paths. Moreover, a higher threshold would enable the selection of better traces. Thus, a processor desires to execute a block of instructions, the processor first examines the Icache to determine whether the block is present. If not, the block is moved from memory to Icache. When the code is first moved into Icache, a threshold value is set into a counter associated with the particular instruction or instruction bundle (a group of instructions that can be issued together in the same cycle) of the Icache. Each time the instruction or instruction bundle is executed and retired, the counter is decremented by one. When the counter reaches zero, a trap is generated and the instruction (or instruction bundle) is designated as hot code.
After the trap is generated to firmware, a trace selector forms a trace of the hot code. The trace is followed to determine the location of the target, i.e the next instruction. The Icache maintains branch history information for the instructions in each cache line. This branch history is used to determine whether a branch should be predicted (as thus treated) as taken or to fall through. If the branch is predicted to fall through, then the subsequent instruction bundle is the next instruction. If the branch is predicted to be taken, the target instruction is the next instruction. After the trace is completed, it is optimized and stored into a trace memory portion of the physical memory. The mapping of the starting address of the original trace to the location of the optimized trace in Trace Memory (TM) is maintained in the IP-to-TM Table. The instruction fetch unit consults the IP-to-TM Table to decide whether the execution should continue with an optimized trace in the TM. There is an IP-to-TM cache in the instruction fetch unit to speed up the access of the IP-to-TM Table. The processor consults the IP-to-TM cache prior to examining the Icache. Therefore, upon subsequent execution of this code, the processor examines the IP-to-TM cache, which then points to the trace memory location. Thus, the code in the trace memory is executed instead of the original binary code. Note that if the code has not been optimized, the processor will execute the original code in the Icache. Note that the instruction from the TM (Trace Memory) will also be moved into the Icache before execution, not just the original code.
Since the inventive RTOS uses hardware managed trace address mapping, the complexity of backpatching may be eliminated. This also avoids reserving registers for passing arguments in trampoline code used in backpacking that may introduce a register spilling penalty. Furthermore, the inventive RTOS can significantly reduce the cost of handling indirect branches. Since the non-optimized code runs at native speeds, the indirect branch is allowed to execute which returns control to native code. Note that a hardware (or processor) table lookup is significantly faster than a software (or emulator) table lookup. For example, a search of the IP-to TM cache may require one cycle, whereas a software lookup of a table would require from 10 to 1000 cycles. The software lookup is expensive because the current architecture states must be saved before returning to the software RTS (Run-Time System).
The inventive RTOS uses hardware to directly process non-trace code, which significantly improves the reliability of the dynamic optimizer. The dynamic translator can choose not to translate some difficult traces and leave them unchanged in the original code, since this code will be executed at native speeds. Therefore, the time penalty for not optimized code is much lower than compared with the prior art software emulation.
Specifically, a decision in the prior art RTOS not to optimize code, meant that the code would be executed at emulator speeds. Note that in the prior art, all hot code is optimized. Furthermore, the reliability aspects are improved because less code would need to be translated, and thus fewer problems from translation will be introduced into the program application.
The inventive RTOS is controlled at the processor and at the firmware level, which is below the OS. Therefore, the inventive RTOS can handle OS code.
Therefore, it is a technical advantage of the present invention to have the run-time optimization system (RTOS) embedded into the hardware.
It is another technical advantage of the present invention that the embedded RTOS does not require software emulation for code profiling to determine hot code.
It is further technical advantage of the present invention that the embedded RTOS can substantially reduce the cost of handling indirect branches.
It is further technical advantage of the present invention that the embedded RTOS can elect not to translate difficult code, and run such code at native speeds.
It is further technical advantage of the present invention that the embedded RTOS can handle OS code in addition to user application codes.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and the specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims.