1. Field of the Invention
This invention relates generally to the field of microprocessors and, more particularly, to the execution of floating point compare operations in a floating point coprocessor.
2. Description of the Related Art
Most microprocessors must support multiple data types. For example, x86-compatible microprocessors must execute two types of instructions; one set defined to operate on integer data types, and a second set defined to operate on floating point data types. In contrast with integers, floating point numbers have fractional components and are typically represented in exponent-significand format. For example, the values 2.15H103 and xe2x88x9210.5 are floating point numbers while the numbers xe2x88x921, 0, and 7 are integers. The term xe2x80x9cfloating pointxe2x80x9d is derived from the fact that there is no fixed number of digits before or after the decimal point, i.e., the decimal point can float. Using the same number of bits, the floating point format can represent numbers within a much larger range than integer format. For example, a 32-bit signed integer can represent the integers between xe2x88x92231 and 231xe2x88x921 (using two""s complement format). In contrast, a 32-bit (xe2x80x9csingle precisionxe2x80x9d) floating point number as defined by the Institute of Electrical and Electronic Engineers (IEEE) Standard 754 has a range (in normalized format) from 2xe2x88x92126 to 2127xc3x97(2xe2x88x922xe2x88x9223) in both positive and negative numbers.
FIG. 1 illustrates an exemplary format for an 8-bit integer 100. As the figure illustrates, negative integers are represented using the two""s complement format 106. To negate an integer, all bits are inverted to obtain the one""s complement format 102. A constant 104 of one is then added to the least significant bit (LSB).
FIG. 2 shows an exemplary format for a floating point value. Value 110 is a 32-bit (single precision) floating point number. Value 110 is represented by a significand 112 (23 bits), a biased exponent 114 (8 bits), and a sign bit 116. The base for the floating point number (2 in this case) is raised to the power of the exponent and multiplied by the significand to arrive at the number represented. In microprocessors, base 2 is most common. The significand comprises a number of bits used to represent the most significant digits of the number. Typically, the significand comprises one bit to the left of the radix point and the remaining bits to the right of the radix point. A number in this form is said to be xe2x80x9cnormalizedxe2x80x9d. In order to save space, in some formats the bit to the left of the radix point, known as the integer bit, is not explicitly stored. Instead, it is implied in the format of the number.
Floating point values may also be represented in 64-bit (double precision) or 80-bit (extended precision) format. As with the single precision format, a double precision format value is represented by a significand (52 bits), a biased exponent (11 bits), and a sign bit. An extended precision format value is represented by a significand (64 bits), a biased exponent (15 bits), and a sign bit. However, unlike the other formats, the significand in extended precision includes an explicit integer bit. Additional information regarding floating point number formats may be obtained in IEEE Standard 754.
The recent increased demand for graphics-intensive applications (e.g., 3D games and virtual reality programs) has placed greater emphasis on a microprocessor""s floating point performance. Given the vast amount of software available for x86 microprocessors, there is particularly high demand for x86-compatible microprocessors having high performance floating point units. Thus, microprocessor designers are continually seeking new ways to improve the floating point performance of x86-compatible microprocessors.
One technique used by microprocessor designers to improve the performance of all floating point instructions is pipelining. In a pipelined microprocessor, the microprocessor begins executing a second instruction before the first has been completed. Thus, several instructions are in the pipeline simultaneously, each at a different processing stage. The pipeline is divided into a number of pipeline stages, and each stage can execute its operation concurrently with the other stages. When a stage completes an operation, it passes the result to the next stage in the pipeline and fetches the next operation from the preceding stage. The final results of each instruction emerge at the end of the pipeline in rapid succession.
Another popular technique used to improve floating point performance is out-of-order execution. Out-of-order execution involves reordering the instructions being executed (to the extent allowed by dependencies) so as to keep as many of the microprocessor""s floating point execution units as busy as possible. As us ed herein, a microprocessor may have a number of execution units (also called functional units), each optimized to perform a particular task or set of tasks. For example, one execution unit may be optimized to perform integer addition, while another execution unit may be configured to perform floating point addition.
Typical pipeline stages in a modem microprocessor include fetching, decoding, address generation, scheduling, execution, and retiring. Fetching entails loading the instruction from the instruction cache. Decoding involves examining the fetched instruction to determine how large it is, whether or not it requires an access to memory to read data for execution, etc. Address generation involves calculating memory addresses for instructions that access memory. Scheduling involves the task of determining which instructions are available to be executed and then conveying those instructions and their associated data to the appropriate execution units. The execution stage actually executes the instructions based on information provided by the earlier stages. After the instruction is executed, the results produced are written back either to an internal register or the system memory during the retire stage.
Yet another method used by designers to improve performance and simplify the design of the microprocessor is to logically separate the floating point portions of the microprocessor from the integer portions. In this configuration, the floating point portions of the microprocessor are referred to as a floating point coprocessor or floating point unit (FPU), even though it is typically implement ed on the same silicon substrate as the microprocessor. If a floating point instruction is detected by the microprocessor, the instruction is handed off to the floating point coprocessor for execution. The coprocessor then executes the instruction independently from the rest of the microprocessor. Since the floating point coprocessor has its own set of registers, this technique works well for most floating point instructions. However, there are some floating point instructions that interface with the integer side of the microprocessor. For example, the instructions FCOMI, FCOMIP, FUCOMI, and FUCOMIP (collectively referred to herein as xe2x80x9cFCOMI-typexe2x80x9d instructions) perform floating point compare operations and then set certain integer flags (i.e., flags in the integer EFLAGS register). After executing FCOMI-type instructions, the coprocessor is configured to convey the results to the integer portions of the microprocessor for storage in the EFLAGS register. The extra step of conveying the results to the integer side for storage normally does not inhibit performance significantly.
However, in many cases FCOMI-type instructions are followed by FCMOV (floating point conditional move) type instructions. As used herein, FCMOV-type instructions include all floating point instructions that perform conditional moves based upon one or more integer flags. FCMOV-type instructions test the flags in the EFLAGS register and then perform a move operation if a specified test condition is true (e.g., if the zero flag is set). The FCMOV-type instruction is dependent upon the results of the FCOMI-type instruction, and therefore cannot be scheduled to execute until the desired flags are read back from the integer EFLAGS register.
In some current coprocessor implementations, executing an FCOMI-type instruction followed by an FCMOV-type instruction creates a significant delay as the flags are sent to the integer side and then read back. For example, some coprocessor configurations are limited to passing values using the microprocessor""s instruction cache or data cache. This involves the time consuming process of clearing out a line in the cache, storing the information, signaling its availability to the coprocessor, and then reading the flags into the coprocessor. This delay in scheduling the FCMOV-type instruction may stall the coprocessor pipeline and may negatively impact performance.
Since modem compilers often generate x86 code that contains FCOMI-type/FCMOV-type instruction sequences to avoid branches, accelerating this code sequence is particularly desirable. Thus, an efficient system and method for rapidly executing FCOMI-type/FCMOV-type instruction sequences is needed.
The problems outlined above may at least in part be solved by a microprocessor configured to rapidly execute FCOMI-type instructions that are immediately followed by FCMOV-type instructions. The microprocessor may rapidly execute these instructions by utilizing a temporary floating point register to store the result flags from the FCOMI-type instructions. The temporary register is then used to provide the FCMOV-type instruction with the condition flags. Since the temporary register is local to the floating point coprocessor, in some embodiments this configuration may eliminate much of the delay associated with waiting for the result flags to be conveyed to the integer side and then reading them back again.
Depending upon the implementation, an FCMOV-type instruction xe2x80x9cimmediatelyxe2x80x9d follows an FCOMI-type instruction if one of the following conditions are true:
(a) No integer instructions occur between the FCOMI-type instruction and the FCMOV-type instruction;
(b) No EFLAGs-altering integer instructions occur between the FCOMI-type instruction and the FCMOV-type instruction;
(c) No integer instructions that change the zero flag, parity flag, or carry flag occur between the FCOMI-type instruction and the FCMOV-type instruction (this is explained in greater detail below);
(d) No instructions that use the Ftemp register (explained in greater detail below) occur between the FCOMI-type instruction and the FCMOV-type instruction (this may be enforced by preventing Ftmep from being used by any instructions other than FCOMI-type instructions or FCMOV-type instructions); or
(e) Combinations of (b) and (d) or (c) and (d).
Some embodiments may only check for condition (a), while other embodiments may utilize more elaborate schemes to check for condition (d).
In some embodiments, an FCMOV-type instruction xe2x80x9cimmediatelyxe2x80x9d follows an FCOMI-type instruction if there are no intervening integer instructions that can change the microprocessor""s integer EFLAGS register. For example, in the following code sequence, the FCMOV-type instruction is said to immediately follow the FCOMI-type instruction:
FCOMI [MEM]
FSQRT
FCMOV BX
The FCMOV instruction is said to immediately follow the FCOMI [MEM] instruction because the FSQRT instruction will have no effect on the EFLAGS register. Additional conditions may also be used (e.g., no intervening integer instructions, regardless of whether they can change the EFLAGS register).
Generally speaking, in one embodiment a microprocessor configured to rapidly execute FCOMI-type instructions immediately followed by FCMOV-type instructions will include an instruction cache and a floating point unit. The instruction cache is configured to store both floating point instructions and integer instructions. The floating point unit is coupled to receive the floating point instructions from the instruction cache. The floating point unit may include a mechanism for detecting floating point conditional move (FCMOV) type instructions that immediately follow floating point compare (FCOMI) type instructions that rely on integer flags. The floating point unit may also include a temporary storage register configured to store results from the FCOMI-type instructions. The floating point unit may further include a mechanism for forcing the FCMOV-type instructions to use the temporary storage register as a source for flag information in lieu of the integer flags registers. Examples of such mechanisms include rename units (described in greater detail below), control logic, and functional units within the floating point coprocessor.
In some embodiments, the FCMOV-type instructions are configured to not use the temporary storage registers as a source for flag information if one or more integer instructions occur between the FCMOV-type instruction and the FCOMI-type instruction. As noted above, other possible considerations include: (i) whether any intervening instructions also use the temporary register, and (ii) whether any intervening integer instructions capable of altering the integer flags have occurred.
A method for rapidly executing FCOMI-type instructions immediately followed by FCMOV-type instructions in a microprocessor is also contemplated. Generally speaking, in one embodiment the method includes storing the results of FCOMI-type instructions to a temporary destination register, and then assigning the FCMOV-type instructions the temporary floating point register as a source. The results from the FCOMI-type instructions may be stored in parallel to both an integer flag register and the temporary floating point register. Depending upon the implementation, this may advantageously reduce the time traditionally needed for the write to and read-back from the integer flag register. As previously noted, a particular FCMOV-type instruction may be said to immediately follow a particular FCOMI-type instruction if: (i) there are no integer instructions between the particular FCMOV-type instruction and the particular FCOMI-type instruction, (ii) there are no instructions between the particular FCMOV-type instruction and the particular FCOMI-type instruction that have an ability to change the integer flag register, and (iii) there are no instructions between the particular FCMOV-type instruction and the particular FCOMI-type instruction that have an ability to change the temporary register. In other embodiments one or two of the above criteria may be selected in lieu of using all three.
A computer system configured to rapidly execute FCOMI-type instructions immediately followed by FCMOV-type instructions is also contemplated. In one embodiment, the computer system may comprise a system memory, a communications device for transmitting and receiving data across a network, and one or more microprocessors coupled to the memory and the communications device. The microprocessors may advantageously be configured as described above.