Recently, a new microprocessor was developed which combines a simple but very fast host processor (called a “morph host”) and software (referred to as “code morphing software”) to execute application programs designed for a processor different than the morph host processor. The morph host processor executes the code morphing software which translates the application programs dynamically into host processor instructions which are able to accomplish the purpose of the original software. As the instructions are translated, they are stored in a translation buffer where they may be accessed and executed without further translation. Although the initial translation of a program is slow, once translated, many of the steps normally required for hardware to execute a program are eliminated. The new microprocessor has proven able to execute translated “target” instructions at a rate equivalent to that attained by the “target” processor for which the programs were designed.
In order to be able to run programs designed for other processors at a rapid rate, the morph host processor includes a number of hardware enhancements. One of these enhancements is a gated store buffer which holds memory stores generated during execution of a sequence of translated host instructions. A second enhancement is a set of host registers (in addition to normal working registers) which hold the state of the target processor at the beginning of any sequence of target instructions being translated. Sequences of target instructions spanning known states of the target processor are translated into host instructions and executed. In one embodiment, if the translated instructions execute without raising an exception, the memory stores held in the gated store buffer are committed to memory; and the registers holding the target state are updated to the target state at the point at which the sequence completed executing. This is referred to as a “commit” operation.
If an exception occurs during the execution of the sequence of host instructions, processing stops; the side effects of the attempted execution may be discarded; and execution may be returned (“rolled back”) to the beginning of the sequence of target instructions at which point known state of the target processor exists. This allows very rapid and accurate handling of exceptions, a result which has never been accomplished by the prior art.
Speculation is a term applied to methods for attempting to execute a process even though it is not known with absolute certainty that the process will execute without error. Rather than taking the steps necessary to provide absolute certainty, speculative execution attempts to execute those processes which will very likely execute without error presuming that the total time required for those speculative executions which succeed and for any fix up required by those which do not succeed will be less than the time required to assure that all processes attempted will surely succeed.
It will be noted that the method by which the new microprocessor handles translations by buffering their side effects until execution has been completed enables very rapid execution by speculating that translations will be correct. The availability of this method using the same gated store buffer circuitry and saved register state for rapidly and efficiently handling host level exceptions and faults allows the new microprocessor to speculate on the outcome of other operations.
For example, many processors (including embodiments of the new microprocessor) include a plurality of execution units which are capable of functioning in parallel. In order to make use of multiple functional units and pipelined functional units as well as to mask operation latency, independent operations are reordered and scheduled. Such processors often utilize a scheduler to reorder instructions so that sequences may more efficiently utilize the units. To find a sufficient pool of independent operations, the scheduler must consider operations from multiple basic blocks which means that sequences which include branch operation must be scheduled. Because branch operations are frequent (approximately one in every six), if scheduling is limited to operations between branches, there are not enough independent operations to fully utilize the fine-grain parallelism inherent in pipelined (RISC) or multi-functional unit (superscalar, VLIW) processors.
By utilizing a software scheduler to reorder the naively translated instructions before executing those instruction sequences and by taking advantage of the hardware support for rollback and commit, the new microprocessor is able to accomplish more aggressive reordering than has been attempted by the prior art. When such a reordered sequence of instructions executes to produce a correct result, the reordered sequence may be committed to the translation buffer and target state may be updated. If the reordered sequence generates an exception while executing, then the state of the processor may be rolled back to target state at the beginning of the sequence and a more conservative approach taken in translating the sequence.
Schedulers have always found reordering sequences of instructions which include branch operations to be difficult. For example, if a sequence of instructions includes a branch, and one path is usually taken at the branch, then the sequence of instructions including that path may be reordered to run more rapidly on the presumption that that path will be taken. Such reordering may move an operation from a point following a branch to a point just before the branch in order to utilize a processor execution unit which would otherwise not be utilized during that period. Moving such an instruction may have no effect other than to speed operations if the presumed path is followed. However, moving the instruction may cause problems if the presumed path is not followed. For example, the ordering may cause a change in a register value for use in the presumed path following the branch; if another path is taken, the value may be incorrect on that path. There are many other instances of problems generated by reordering operations around branches.
The prior art has typically taken care of problems of this sort by using less aggressive speculation over shorter sequences of operations, by renaming operations which have been reordered to eliminate value changes, and by providing “compensation” code to repair errors which may be caused by the reordering which has been done. All of these approaches optimize the common path at the expense of less frequently utilized execution paths.
It is desirable to provide a new method of more aggressively reordering and scheduling operations in sequences including branch operations while eliminating errors and accelerating the speed of a microprocessor.
Moreover branch operations are themselves often a bottleneck because they both restrict scheduling and consume instruction issue bandwidth. It is desirable to provide methods for scheduling which eliminate many of the branches normally encountered.
Not only are branches difficult to deal with in optimizing sequences of instructions, similar problems occur because optimized sequences may be interrupted during execution by processes which affect the outcome of execution of the optimized sequence. For example, it may be desirable to optimize a sequence of instructions providing a loop by removing an invariant from the loop. For example, a value stored at a memory address may be loaded each time a loop iterates so that removing the operation and performing it before the loop significantly shorten the overall execution process so long as the value loaded remains constant. However, if the optimized loop is interrupted by an independent process such as a direct memory access (DMA) which writes a new value to the memory address read by the removed operation, then the results produced by the optimized loop will be incorrect. Similarly, a loop may store to a memory address on each iteration. If only the store on the last loop iteration is used by the process, then the store may be removed from the loop and placed in an epilogue to the loop. However, if the optimized loop is interrupted by an independent process which reads the value at the memory address stored to by the removed operation, then the independent operation will read an incorrect value. For this reason, prior art processes have been unable to optimize sequences of instructions by removing an invariant operation from a loop where the underlying memory is volatile.
It is desirable to provide a new method for optimizing sequences of instructions by removing an invariant operation from a loop where the underlying memory is volatile.