Recently, a new microprocessor was developed which combines a simple but very fast host processor (called a “morph host”) and software (referred to as “code morphing software”) to execute application programs designed for a “target” processor having an instruction set different than the instruction set of the morph host processor. The morph host processor executes the code morphing software to translate the application programs into morph host processor instructions which accomplish the purpose of the original target software. As the target instructions are translated, the new host instructions are both executed and stored in a translation buffer where they may be accessed without further translation. Although the initial translation of a program is slow, once translated, many of the steps normally required for hardware to execute a program are eliminated. The new microprocessor has demonstrated that a simple fast processor designed to expend little power is able to execute translated “target” instructions at a rate equivalent to that of the “target” processor for which the programs were designed.
In order to be able to run programs designed for other processors at a rapid rate, the morph host processor includes a number of hardware enhancements. One of these enhancements is a gated store buffer which resides between the host processor and the translation buffer. A second enhancement is a set of host registers (in addition to normal working registers) which store known state of the target processor existing prior to any sequence of target instructions being translated. Memory stores generated as sequences of translated morph host instructions are executed are placed in the gated store buffer. If the morph host instructions execute without raising an exception, the target state at the beginning of the sequence of instructions is updated to the target state at the point at which the sequence completed and the memory stores are committed to memory.
On the other hand, if an exception is raised during execution of the morph host instructions, execution stops, the host processor rolls back operation to the last point at which target state was known to be correct, and execution proceeds from that point utilizing a process (an interpreter in one embodiment) which accomplishes step-by-step translation of each of the target instructions. This process essentially single steps through the execution of target instructions. As each target instruction is translated and executed, the state of the target processor is brought up to date. The process continues during the translation and execution of the remainder of the sequence of target instructions until the exception reoccurs. When the exception reoccurs, target state will be correct for handling the exception. The use of these hardware enhancements with the rollback process allows exceptions to be accurately handled while dynamic translation of target instructions is taking place. The improved processor is described in detail in U.S. Pat. No. 5,958,061, entitled Combining Hardware And Software to Provide An Improved Microprocessor, R. Cmelik et al., issued Feb. 29, 2000, and assigned to the assignee of the present invention.
A problem which has occurred with the new processor relates to the execution of floating point operations translated from instructions originally programmed for a target processor. Floating point processors execute some mathematical operations quite rapidly. For example, multiplication of floating point values requires simply adding exponents consisting of zeroes and ones and multiplying the mantissas by shifting a binary point. On the other hand, addition of mantissas requires a pre-normalization step of aligning binary points, an addition, and finally a post-normalization step of realigning the binary point. Consequently, most floating point operations require a number of clock cycles and are therefore somewhat slow. In fact, all operations other than square root and division require four clock cycles to execute utilizing the new microprocessor. Division and square root operations take an indeterminate amount of time and may require halting the operations of the processor until they complete.
Because floating point operations require a number of clock cycles to execute, most modern floating point processors (including the floating point processor unit of the new microprocessor) pipeline floating point operations. Pipelining executes a number of floating point operations in parallel and usually starts a new floating point operation on each succeeding clock cycle. The effect of running operations in parallel which start on sequential clocks is to produce one floating point result for each clock cycle during most sequences of floating point operations.
Modern floating point processors not only pipeline operations but also attempt to reorder floating point operations to attain even greater speed. However, floating point operations are difficult to reorder. Not only do floating point processors produce a numerical result as output for each operation, they also typically provide a number of status bits which indicate whether the result should raise an exception. These status bits-indicate whether an operation caused an overflow or an underflow, whether an operation was invalid, whether an operand was not in a normal number format (i.e., was “denormal”) whether the operation attempted a divide by zero, and whether the precision provided by the result is inexact. Each of these conditions could require exceptional handling in order for the result to be correct. A user may arm or disarm individual exceptions to produce the results desired. The precise exceptions are defined by the floating point standard of IEEE 754.
When translating target instructions designed for execution by a target processor, it is necessary to provide instructions which produce the same results as would the target processor. For example, if the target instructions are designed to be executed by an Intel X86 processor, then the translated instructions should produce the same results as would be produced by an X86 processor. The early Intel X86 processors (more particularly, the X87 floating point unit) handled floating point operations one at a time and generated both a result and status bits for that result immediately after each individual floating point operation. X86 processors have continued to function in this manner.
Consequently, it is necessary for the new processor when translating X86 floating point instructions to provide the same status bits which are correct for each result as the result issues.
Providing correct status bits with each result as the result issues is especially difficult when pipelining floating point operations since the status bits for a floating point operation are not known until the floating point operation completes, typically four cycles after commencing. The prior art has found no solution to the problem of producing accurate status bits with each result produced other than to terminate pipelining of floating point operations and handle floating point operations one at a time.
Providing correct status bits with each result while pipelining operations in the new processor is not only difficult because of the delay in generating status bits, the condition of status bits also complicates floating point operations which have been reordered to a position in a sequence of operations at which state is to be committed by the new processor. In order to function correctly, the status bits must be correct not only for those floating point operations which have executed in their normal order but also for those floating point operation which have been reordered before state including the status bits can be committed.
Although the prior art has not been able to provide correct status bits without stopping the pipeline, there have been different solutions for terminating the pipeline. For example, the Alpha processor designed by Digital Equipment Corporation simply ignores the problem of issuing correct status together with the result of a floating point operation in order to run floating point operations at a speed attainable by pipelining. However, a programmer may insert commands into a program to be executed by an Alpha processor which select sequences of floating point operations which are to produce precise floating point status. When a program reaches a command inserted by a programmer to materialize precise floating point state, the processor stalls and drains its pipeline (finishes executing floating point instructions in flight) so that after the pipeline is drained, the pipeline corresponds to all previously executed floating point instructions. Exceptions, if pending and enabled, are raised at this point; and only after the exceptions have been handled can subsequent floating point instructions start to execute.
In a situation in which exceptions must be raised precisely after any floating point instruction, each floating point instruction must be followed by the special commands, effectively disabling the pipelining and reordering of floating point instructions. These commands allow a programmer to decide which floating point operations should execute accurately even though very slowly. However, since a programmer will not necessarily understand where status exceptions may be raised by floating point operations, long sequences of operations may often have to be selected for this slow mode of operation.
Intel Corporation takes a different approach which it calls safe instruction recognition. Modern Intel X85 processors pipeline floating point operations but utilize complex circuitry for evaluating floating point numbers prior to executing any floating point operation to determine whether those numbers might produce results giving rise to the exceptions denoted by the status bits. For each set of floating point numbers utilized in an operation, a decision is made (1) that these numbers certainly will not generate an exception and thus may be processed using pipelining or (2) that it is not certain that the numbers will not generate an exception so that the pipeline must be stalled and the operations processed one by one. The approach allows pipelining but requires a significant increase in circuitry to pre-evaluate floating point operands and operations and slows operations through its conservative approach.
Neither of these approaches provides an optimum result which allows a floating point processor to execute as rapidly as possible utilizing full pipelining techniques while assuring that correct status for each individual floating point operation is produced.
It is desirable to improve the operational speed of the improved microprocessor by increasing the speed of floating point operations.