This application relates in general to run-time optimizers, and in specific to hardware embedded run-time optimizers.
A run-time optimizer is an adaptive software system that transparently optimizes applications at run-time. The optimizer rewrites the binary code of an application on-the-fly to achieve a higher execution efficiency.
FIG. 2 depicts prior art run time optimizer 20. The control loop 21 begins execution of a block of program code, via emulation performed by the profiling emulator 22. The profiling aspect of emulator 22 allows the control loop 21 to track the number of times the particular block of code has been executed via emulation. Note that a run time optimization system is different from a run time binary translation system, in that the latter is for architecture migration, while the former is to decrease execution time. The run time optimization system is using the emulator 22 for profiling in order to guide optimizations, i.e. the code is running on its native system. After a predetermined number of executions via emulation, the control loop 21 designates the block of code as hot code, and desirable for optimization. The control loop 21 then activates trace selector 23 to translate the block of code. The trace selector 23 forms a trace of the instructions that comprise the block of code by following the instructions in the block. When a branch instruction is encountered, the trace selector makes a prediction as to whether the branch is taken or falls through. If the selector decides the branch is mostly taken, then the trace is formed by extending the code from the branch target block. If the selector decides not to take the branch, then the branch falls through, and the trace continues within the fall through block. The trace terminates at a backward branch predicted to take or when the trace becomes sufficiently large. After the trace is completed, the code is rewritten with machine dependent and machine independent optimizations. The optimized code is then placed into the code cache 24. The next time the control loop 21 encounters a condition to execute this block of code, then the control loop 21 will execute the code in the code cache 24 and not emulate the code via emulator 22.
As shown in FIG. 3, if the target of a branch which is taken to exit trace 1, as shown by branch instruction 31, then control is returned to the run time system RTS 20 and to control loop 21, which determines if the target resides in the code cache. If the target resides in code cache, then the control loop 21 modifies the target of the branch instruction 31 to be the trace 2 in code cache as shown by branch instruction 33. This modification is called backpatching. Thus, if the exit of the trace is already translated, then the branch is backpatched such that a subsequent execution will directly branch to the new trace without returning to the control loop. Backpatching increases the speed of execution of the code, as returning to the RTS significantly slows down execution time.
A problem with the prior art RTS is that it cannot backpatch an indirect branch. The RTS cannot backpatch an indirect branch because the target address is unknown. The target address is typically in a register or memory location, and not written directly in code. Thus, the RTS will shift control back to the control loop 21 to determine whether the target address has been translated, which is expensive in terms of time. The prior art has attempted to minimize this problem by inlining a code sequence to search a smaller look up table in the optimized traces, however, these mechanism still incur high overhead. Moreover, if small table lookup fails then the RTS will shift control back to the control loop, as described above. Examples of indirect branches are return branches and switch branches. This software approach adds an additional 10-100 s of cycles to the processing time.
Therefore, there is a need in the art for a RTS that can handle indirect branches without returning control to a control loop.
These and other objects, features and technical advantages are achieved by a system and method that uses a table to map branch targets that is built into the hardware as cache. Thus, when a fetch instruction is initiated, the IP-to-TM cache is examined to determine whether the branch target instruction has been optimized and placed into the trace memory. If there is a match with the P-to-TM cache, then the code in the trace is executed.
The P-to-TM cache is in the instruction fetch unit. This cache maps branch targets to optimized traces. This cache is examined in parallel with Instruction Translation Lookup Buffer (ITLB), if a match is found, the control will transfer to the optimized code. Otherwise, the execution control continues on the original code. This cache can significantly speed up the process of mapping a branch target to an optimized trace. Protection information will be included in the cache to enforce various protection needs typically served by ITLB.
The inventive mechanism eliminates the need to add a table look up to access separate pipeline stages, and thus increases the speed of the pipeline steps. This eliminates the need to use a software table look up and special handling of indirect branches.
When instructions are executed, the instructions typically have a virtual address. This virtual address needs to be translated into a physical memory address, particularly when assessing caches like the instruction cache or data cache. This translation is usually done by the ITLB. Thus, every time an instruction needs to be fetched, the ITLB needs to be examined. The inventive mechanism uses this requirement to perform a parallel lookup in the IP-to-TM cache.
The inventive mechanism first determines if the target address is a trace address, if so then the IP-to-TM cache does not need to be examined, and the instruction can be fetched directly from the trace memory. If the target address is virtual address in original code address, then the mechanism examines both the IP-to-TM cache and ITLB. If the address matches an entry in the IP-to-TM cache, then the instruction is fetched from trace memory. Note that if the IP-to-TM cache hits, then the ITLB should also hit, however, the ITLB hit is ignored. If the IP-to-TM cache misses, and the ITLB hits, then the instruction is fetched from original code and executed from there. If both the IP-to-TM cache and ITLB miss, then the mechanism invokes a hardware walker to load the correct translation into the ITLB.
Therefore, it is a technical advantage of the present invention to have the mapping from a branch to a trace stored in the IP-to-TM cache that is embedded into the hardware.
It is further technical advantage of the present invention that the IP-to-TM cache handles indirect branches much more efficiently than a software approach.
It is a further technical advantage of the present invention that the IP-to-TM cache and the ITLB are examined in parallel.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and the specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims.