1. Field of the Invention
This invention relates generally to the field of microprocessors and, more particularly, to floating point units within microprocessors.
2. Description of the Related Art
Most microprocessors must support multiple data types. For example, x86-compatible microprocessors must execute two types of instructions; one set defined to operate on integer data types and another set defined to operate on floating point data types. In contrast with integers, floating point numbers have fractional components and are typically represented in exponent-significant format. For example, the values 2.15H103 and xe2x88x9210.5 are floating point numbers while the numbers xe2x88x921, 0, and 7 are integers. The term xe2x80x9cfloating pointxe2x80x9d is derived from the fact that there is no fixed number of digits before or after the decimal point, i.e., the decimal point can float. Using the same number of bits, the floating point format can represent numbers within a much larger range than integer format. For example, a 32-bit signed integer can represent the integers between xe2x88x92231 and 231xe2x88x921 (using two""s complement format). In contrast, a 32-bit (xe2x80x9csingle precisionxe2x80x9d) floating point number as defined by the Institute of Electrical and Electronic Engineers (IEEE) Standard 754 has a range (in normalized format) from 2xe2x88x92126 to 2127xc3x97(2xe2x88x922xe2x88x9223) in both positive and negative numbers.
FIG. 1 illustrates an exemplary format for an 8-bit integer 100. As the figure illustrates, negative integers are represented using the two""s complement format 106. To negate an integer, all bits are inverted to obtain the one""s complement format 102. A constant 104 of one is then added to the least significant bit (LSB).
FIG. 2 shows an exemplary format for a floating point value. Value 110 is a 32-bit (single precision) floating point number. Value 110 is represented by. a significant 112 (23 bits), a biased exponent 114 (8 bits), and a sign bit 116. The base for the floating point number (2 in this case) is raised to the power of the exponent and multiplied by the significand to arrive at the number represented. In microprocessors, base 2 is most common. The significand comprises a number of bits used to represent the most significant digits of the number. Typically, the significand comprises one bit to the left of the radix point and the remaining bits to the right of the radix point. A number in this form is said to be xe2x80x9cnormalizedxe2x80x9d. In order to save space, in some formats the bit to the left of the radix point, known as the integer bit, is not explicitly stored. Instead, it is implied in the format of the number.
Floating point values may also be represented in 64-bit (double precision) or 80-bit (extended precision) format. As with the single precision format, a double precision format value is represented by a significand (52 bits), a biased exponent (11 bits), and a sign bit. An extended precision format value is represented by a significand (64 bits), a biased exponent (15 bits), and a sign bit. However, unlike the other formats, the significand in extended precision includes an explicit integer bit. Additional information regarding floating point number formats may be obtained in IEEE Standard 754.
The recent increased demand for graphics-intensive applications (e.g., 3D games and virtual reality programs) has placed greater emphasis on a microprocessor""s floating point performance. Given the vast amount of software available for x86 microprocessors, there is particularly high demand for x86-compatible microprocessors having high performance floating point units. Thus, microprocessor designers are continually seeking new ways to improve the floating point performance of x86-compatible microprocessors. While some x86 floating Point instructions perform arithmetic (e.g., FADD which adds two floating point numbers), other floating point instructions perform logic functions. For example, the instruction FCOM performs a comparison of two real values. Other examples of x86 floating point instructions that perform compares are FTST (compares top of stack with zero) and FICOM (compare integer). Still other x86 floating point instructions perform control functions. For example, the instruction FSTSW stores the floating point unit""s architectural status word to a specified destination (e.g., memory or the integer register AX).
One technique used by microprocessor designers to improve the performance of all floating point instructions is pipelining. In a pipelined microprocessor, the microprocessor begins executing a second instruction before the first has been completed. Thus, several instructions are in the pipeline simultaneously, each at a different processing stage. The pipeline is divided into a number of pipeline stages, and each stage can execute its operation concurrently with the other stages. When a stage completes an operation, it passes the result to the next stage in the pipeline and fetches the next operation from the preceding stage. The final results of each instruction emerge at the end of the pipeline in rapid succession.
Another popular technique used to improve floating point performance is out-of-order execution. Out-of-order execution involves reordering the instructions being executed (to the extent allowed by dependencies) so as to keep as many of the microprocessor""s floating point execution units as busy as possible. As used herein, a microprocessor may have a number of execution units (also called functional units), each optimized to perform a particular task or set of tasks. For example, one execution unit may be optimized to perform integer addition, while another execution unit may be configured to perform floating point addition.
Typical pipeline stages in a modern microprocessor include fetching, decoding, address generation, scheduling, execution, and retiring. Fetching entails loading the instruction from the instruction cache. Decoding involves examining the fetched instruction to determine how large it is, whether or not it requires an access to memory to read data for execution, etc. Address generation involves calculating memory addresses for instructions that access memory. Scheduling involves the task of determining which instructions are available to be executed and then conveying those instructions and their associated data to the appropriate execution units. The execution stage actually executes the instructions based on information provided by the earlier stages. After the instruction is executed, the results produced are written back either to an internal register or the system memory during the retire stage.
While pipelining produces significant improvements in performance, it has some limitations. In particular, certain instructions in certain floating point implementations are unable to be scheduled until all previous instructions have completed execution and have been retired (i.e., committed to the processor""s architectural state). One such instruction is FSTSW (floating point store status word). The FSTSW instruction is configured to access the floating point unit""s architectural floating-point status word. As a result, the FSTSW instruction may be referred to as a xe2x80x9cbottom executingxe2x80x9d instruction because it is not scheduled for execution until all preceding instructions have been executed and retired. Furthermore, instructions occurring after the FSTSW instruction may not be scheduled until after the FSTSW instruction has been scheduled. These problems may be exacerbated when two FSTSW instructions occur near each other in the instruction stream.
Thus, an efficient method for rapidly executing FSTSW-type instructions is desired. In modern x86 floating point software, a significant percentage of FSTSW occurrences are immediately preceded by a floating point compare instructions, e.g., FCOM (floating point compare), FTST (compares top of stack with zero), or FICOM (compare integer instruction). Thus an efficient method for rapidly executing FSTSW instructions when preceded by floating point compare instructions is particularly desirable.
The problems outlined above may at least in part be solved by a microprocessor configured to rapidly execute FSTSW-type instructions that are immediately preceded by FCOM-type instructions. The microprocessor may improve execution by adding a temporary destination register to FCOM-type instructions. As used herein, xe2x80x9cFSTSW-typexe2x80x9d instructions include all store status word variants (e.g., FSTSW, FNSTSW, etc.). In addition, xe2x80x9cFCOM-typexe2x80x9d instructions are used herein to mean any floating point instructions that perform a comparison operation. For example, FCOM (compare real), FCOMP (compare real), FCOMPP (compare real), FICOM (compare integer), FICOMP (compare integer), FTST (test), FUCOM (unordered compare real), FUCOMP (unordered. compare real), FUCOMPP (unordered compare real),: and FXAM (examine real) are all x86 floating point instructions that perform comparison operations.
Furthermore, the term xe2x80x9cimmediately followsxe2x80x9d is used herein to mean that there is no intervening floating point instruction that can change the floating point unit""s status word. For example, in the following code sequence, the FSTSW-type instruction is said to immediately follow the FCOM instruction:
FCOM [MEM]
DEC AX
FSTSW BX
The FSTSW-type instruction is said to immediately follow the FCOM [MEM] instruction because the DEC AX instruction will not be conveyed to the floating point unit and thus will have no effect on the floating point status word.
In one embodiment, the FCOM-type instructions write their status flags to the floating point architectural status word register (their normal destination) and the temporary destination register. If an FSTSW-type instruction immediately follows the FCOM-type instruction, the FSTSW-type instruction is converted into a special instruction called an FSTSWEF instruction (i.e., a fast store status word instruction) that is configured to use the temporary register as a source in lieu of the architectural floating point status word register. In some configurations, the temporary register may store only the CC-bits (condition code bits) portion of the floating point status word. Advantageously, FSTSWEF instructions may be scheduled for execution as soon as the temporary storage register written to by the FCOM-type instruction becomes valid.
Generally speaking, in one embodiment a microprocessor configured to rapidly execute FSTSW-type instructions preceded by FCOM-type instructions will include an instruction cache configured to store instructions (both floating point and integer) and a floating point unit. The floating point unit is coupled to receive floating point instructions from the instruction cache. The floating point unit or the microprocessor may include a means for detecting FSTSW-type instructions (e.g., logic or a state machine) that immediately follow FCOM-type instruction s. The floating point unit may also have a temporary storage register configured to store results from the floating point compare type instructions. The floating point unit may also have a means for translating FSTSW-type instructions into FSTSWEF instructions, e.g., a rename unit. The floating point unit may also include a scheduler configured to schedule FSTSW-type instructions to execute only after all older floating point instructions have been retired. The scheduler may similarly be configured to schedule FSTSWEF instructions to execute only after the temporary register (e.g., an Ftemp register) becomes valid.
In some embodiments, the microprocessor may also include an architectural register and one or more execution pipelines. The architectural register is configured to store an architectural floating point status word (FPSW) for the floating point unit, and the execution pipelines are coupled to receive floating point instructions from the scheduler. The pipelines use the temporary register as a source for the condition code (CC) bits for any FSTSWEF instructions and the architectural register as a source for any FSTSW-type instructions. The top of stack (TOS) portion of the FPSW may be read from a third source. For example, the TOS may be copied from a register renaming unit into an unused field of the FSTSWEF instruction and then read from the field when needed. In some embodiments, FSTSWEF instructions may obtain the exception bits portion of the FPSW (i.e., bits 0 through 6 of the x86 FPSW) from the architectural FPSW register. Some implementations may assume these bits are valid.
Subsequent instructions that clear one or more of the sticky bits (i.e., bits 0 through 5 of the x86 FPSW) are executed in a serial fashion (i.e., after the FSTSWEF). However, subsequent instructions that set one or more of the sticky bits may be configured to cause a trap to a microcode routine (i.e., a trap handler). The trap may cause an abort which invalidates all instructions younger than the FSTSWEF, and the trap handler may then reinitiate execution beginning with the next sequential instruction.
In addition to the architectural registers and pipelines, the microprocessor may also include a memory (e.g., a ROM) configured to store a trap handling routine that is invoked when an instruction older than an FSTSWEF instruction changes one or more of the architectural status bits after the FSTSWEF instruction has been scheduled or executed.
A method for rapidly executing FSTSW instructions in a microprocessor is also contemplated. In one embodiment, the method comprises storing the results of FCOM-type instructions to a temporary destination register. In addition, FSTSW-type instructions that immediately follow FCOM-type instructions are transformed into FSTSWEF instructions that utilize the temporary destination register as a source register.
In some embodiments, FSTSW-type instructions are scheduled for execution only after all older floating point instructions have been retired. FSTSWEF instructions, however, may be scheduled after the temporary register is valid. The method may further comprise trapping to a microcode routine when any floating point instructions that are older than a speculatively executed FSTSWEF instruction complete execution (and alters one or more of the sticky bits in the FPSW) before the speculatively executed FSTSWEF instruction is retired.
A computer system configured to rapidly execute FSTSW-type instructions immediately preceded by FCOM-type instructions is also contemplated. In one embodiment, the computer system may comprise a system memory, a communications device for transmitting and receiving data across a network, and one or more microprocessors coupled to the memory and the communications device. The microprocessors may advantageously be configured as described above.