The present invention relates to digital computer systems, and more particularly but not by way of limitation, to methods and an apparatus for processing instructions in such systems.
Microprocessors exist that implement a reduced instruction set computing (RISC) instruction set architecture (ISA) and an independent complex instruction set computing (CISC) ISA by emulating the CISC instructions with instructions native to the RISC instruction set. Instructions from the CISC ISA are called xe2x80x9cmacroinstructions.xe2x80x9d Instructions from the RISC ISA are called xe2x80x9cmicroinstructions.xe2x80x9d
Streaming Single-Instruction Multiple-Data Extensions (SSEs) have been developed to enhance the instruction set of the latest generation of certain computer architectures, for example the IA-32 architecture. The SSEs include a new set of registers, new floating point data types, and new instructions. Specifically, the SSEs comprise eight 128-bit Single-Instruction Multiple-Data (SIMD) floating point registers (XMM0 through XMM7) that can be used to perform calculations and operations on floating point data. These XMM registers are shown in FIG. 1A. Each 128-bit floating point register contains four packed 32-bit single precision (SP) floating point (FP) numbers. The structure of the packed 32-bit SP FP numbers is illustrated in the example of FIG. 1B, where four 32-bit SP FP numbers (numbered 0 through 3) are shown as if stored in the XMM2 SSE register. In architectures designed to support the SSEs, such as its native architecture, a single instruction of the SSE instruction set operates in parallel on the four 32-bit SP FP numbers in a particular XMM register.
The SSEs also include a status and control register called the MXCSR register. The format of the MXCSR is illustrated in the example of FIG. 1C. The MXCSR register may be used to selectively mask or unmask exceptions. Specifically, bits 7-12 of the MXCSR register may be used by a programmer to selectively mask or unmask a particular exception. Masked exceptions are those exceptions that a programmer wishes to be handled automatically by the processor which may provide a default response. Unmasked exceptions, on the other hand, are those exceptions that the programmer wishes to be handled by invocation of an interrupt or an operating system handler. This invocation of the handler transfers control of the operating system (such as Windows by Microsoft) where the problem may be corrected or the program terminated.
The MXCSR register may also be used to keep track of the status of exception flags. Bits 0-5 of the MXCSR register indicate whether any of six exceptions have occurred in the execution of an SSE instruction. Those exceptions include the following: invalid operation (I), divide-by-zero (Z), denormal operation (D), numeric overflow (O), numeric underflow (U), or inexact result (P). The status of the flags are xe2x80x9csticky,xe2x80x9d meaning that once they art, set, they are not cleared by any subsequent SSE instruction, even if one is performed without exception. The status flags can only be cleared by a special instruction, usually issued from the operating system.
The exception flags of FIG. 1C are the result of a bitwise logical-OR operation on all four of the 32-bit SP FP operations that are performed on a particular 128-bit register XMM register. (One operation on each of the four 32-bit SP FP numbers.) Thus, if an exception occurs as to any one of the four 32-bit SP FP numbers, the exception flag for that particular type of exception will be raised, indicating some type of problem has occurred in the system. The invalid operation (I) divide by zero (Z), and denormal operation (D) exceptions are pre-computation exceptions, meaning that they are detected before any arithmetic or logical operations occur. That is, they can be detected without doing any computations. The other three exceptions, numeric overflow (O), numeric underflow (U), and inexact result (P) are post-computation exceptions, meaning that they are detected after the operations have been performed. It is possible for an operation performed on a sub-operand (i.e., one of the four operands in a 128-bit XMM register) to raise multiple flags.
SSEs have the following rules for exceptions:
1. When an unmasked exception occurs, the processor executing the instruction will not change the contents of the XMM register. In other words, results will not be committed or stored until it is known that no unmasked exceptions have occurred with respect to any of the four 32-bit SP FP numbers.
2. If there is a masked exception, all exception flags are updated.
3. In the case of unmasked pre-computation exceptions, all flags relating to pre-computation exceptions, whether masked or unmasked, will be updated. However, no subsequent computations are permitted, meaning that no post-execution exceptions can or will occur. This, of course, means that no post-execution exception flags will change or be updated.
4. In the case of unmasked post-computation exceptions, all post-execution conditions, whether masked or unmasked, will be updated, as will 41 pre-computation exceptions. Any pre-computation exceptions will be masked exceptions only because, if the pre-computation exception was unmasked, under Rule No. 3 above, no further computations would have been permitted.
Further information regarding streaming SIMD extensions may be found in the Intel Architecture Software Developer""s Manual, (1999), Volumes 1 through 3, Intel Order Numbers 243190, 243191, 243192, which are hereby incorporated by reference.
In many architectures, provisions have not been made for the SSE instructions. In these non-native architectures, the eight 128-bit floating point XMM registers capable of containing four 32-bit SP FP numbers are not available. In some non-native architectures, the eight 128-bit XMM registers may be mapped onto 16 floating point registers (e.g., IA-64 registers) that may be less than 128 bits and more than 64 bits wide. Specifically, some architectures use 82-bit registers to hold two 32-bit SP FP numbers (the bits in excess, of 64 may be used for the special encoding used to indicate the register holes SIMD-type 32-bit SP FP numbers). An example is shown in FIG. 1D. Note that the four 32-bit SP FP numbers 0-3 stored in the XMM2 register of the SSE native environment (FIG. 1B) are now stored in two 82-bit registers, XMM2_Low and XMM2_High, containing the xe2x80x9clow half of the XMM2 registerxe2x80x9d and the xe2x80x9chigh half of the XMM2 register,xe2x80x9d respectively. This makes parallel execution of an operation on each of the four 32-bit SP FP numbers difficult.
Thus, in this non-native environment, the SSE instructions must be executed by emulation. Specifically, operations may first be performed on two of the four 32-bit SP FP numbers (in parallel) and then be performed on the remaining two 32-bit SP FP numbers (again, in parallel). Operations may alternatively be performed on only one or at least three of the 32-bit SP FP numbers. For example, an operation may be performed on the operands in the low half, XMM2_low, and then on the high half, XMM2_High. However, given the SSE rules for handling exceptions and updating exception flags, problems arise when emulating SSE instructions in this partially parallel, partially sequential manner. For example, consider a set of instructions being performed on the low half and high half of FIG. 1D:
XMM2:=OP(XMM3, XMM4)
emulated by
XMM2_Low:=OP(XMM3_Low, XMM4_Low)
XMM2_High:=OP(XMM3_High, XMM4_High)
Assume that the first instruction is executed without an unmasked exception as to the operands in the low halves, XMM3_Low and XMM4_Low. The results of this operation are then properly committed in XMM2_Low. Assume now that execution of the second instruction on the high halves results in a pre-computation unmasked exception. According to the SSE rules, no subsequent operations are then to be performed on any of the four 32-bit SP FP numbers because of that pre-computation unmasked exception. But here, however results of the operation on the low halves have been committed to register XMM2_Low in violation of the SSE rules. This corrupts the data in XMM2_Low and cannot be allowed to happen.
Prior machines have solved this problem by implementing a xe2x80x9cback-offxe2x80x9d mechanism that allowed them to speculatively change architectural state when the first microinstruction completed, than xe2x80x9cundoxe2x80x9d the change if the second microinstruction had an exception. This back-off mechanism may be hard to implement in certain machines, especially in machines that do not implement register renaming. In many systems, the use of a back-off or undo operation is difficult or limited, for various reasons.
One way of successfully emulating the SSEs and preventing this rule violation is to use a xe2x80x9cshadowxe2x80x9d register mechanism. In a shadow register mechanism, the results of a previous, successful operation on the low halves are physically stored in a shadow register. In this case, in the example above, when the exception is detected on the high halves, the results previously stored in the shadow register for the previous operation on the low halves may be re-stored, that is, an xe2x80x9cundoxe2x80x9d operation on the low halves as performed. The shadow register mechanism, however, is relatively complex. In most systems, there must be at least 16 registers available for storing the results of a previous operation on the low halves, and each must be capable of two 32-bit FP SP numbers. Additionally, when an xe2x80x9cundoxe2x80x9d operation is required, it must be determined which of the shadow registers the desired results are in. This mechanism consumes valuable register space that could otherwise be used more efficiently. Furthermore, a relatively complicated system of pointers and virtual maps are required to store the previous maps.
Another way to emulate a particular SSE instruction is to provide a back-off register mechanism. One skilled in the art will recognize that this technique may require a plurality of registers, a multiplexor and a de-multiplexor combination, various other hardware, and a new set of instructions. All of these increase cost and reduce efficiency.
Yet another way to emulate a particular SSE instruction is to execute the instruction with respect to each of the four 32-bit SP FP numbers in the SSE XMM register one at a time and store the results of each execution in temporary registers. When the instruction has been executed with respect to the fourth 32-bit SP FP number, and no unmasked exceptions, have occurred, the results may then be committed to the appropriate architectural location and exception flags be updated. This method of emulation requires the addition of a relatively complex micro-code sequence and the use of hardware that could otherwise be used more efficiently, not to mention the amount of clock cycles it consumes in executing an instruction four times before results can be committed.
Clearly, there exists a need for a method and an apparatus for emulating the SSE instruction set (and other instruction sets) that makes efficient use of existing hardware and that consumes relatively few clock cycles. Additionally, there exists a need for a method and apparatus for determining whether certain problems may occur in the execution of a series of instructions without committing the results of those instructions.
The present invention is a method for processing instructions by decomposing a macroinstruction into at least two microinstructions, executing the microinstructions in parallel on two separate functional units, and linking the microinstructions such that they appear as though they were executed as a single functional unit. The present invention operates by determining whether certain exceptions occur in either of the functional units, according to SSE rule, for exceptions. If an exception does occur in any of the linked microinstructions, then the execution of each of those microinstructions is canceled. This avoids the necessity of a backoff or undo mechanism.
The present invention is also a computer system for processing software instructions, having a processor with a floating point unit, a ROM, and floating point registers. The processor is configured to decompose a macroinstruction into at least two microinstruction, to execute those microinstructions in parallel, and to link those instructions such that they appear to execute as a single functional unit. The processor also is capable of identifying and treating exceptions without the use of a back-off or undo mechanism.