1. Field of the Invention
This invention relates generally to the field of microprocessors and, more particularly, to floating point units.
2. Description of the Related Art
Most microprocessors must support multiple data types. For example, x86-compatible microprocessors must execute two types of instructions: one set defined to operate on integer data types, and a second set defined to operate on floating point data types. In contrast with integers, floating point numbers have fractional components and are typically represented in exponent-significand format. For example, the values 2.15xc3x97103 and xe2x88x9210.5 are floating point numbers while the numbers xe2x88x921, 0, and 7 are integers. The term xe2x80x9cfloating pointxe2x80x9d is derived from the fact that there is no fixed number of digits before or after the decimal point, i.e., the decimal point can float. Using the same number of bits, the floating point format can represent numbers within a much larger range than integer format. For example, a 32-bit signed integer can represent the integers between xe2x88x92231 and 231xe2x88x921 (using two""s complement format). In contrast, a 32-bit (xe2x80x9csingle precisionxe2x80x9d) floating point number as defined by the Institute of Electrical and Electronic Engineers (IEEE) Standard 754 has a range (in normalized format) from 2xe2x88x92126 to 2127xc3x97(2xe2x88x922xe2x88x9223) in both positive and negative numbers.
FIG. 1 illustrates an exemplary format for an 8-bit integer 100. As the figure illustrates, negative integers are represented using the two""s complement format 106. To negate an integer, all bits are inverted to obtain the one""s complement format 102. A constant 104 of one is then added to the least significant bit (LSB).
FIG. 2 shows an exemplary format for a floating point value. Value 110 a 32-bit (single precision) floating point number. Value 110 is represented by a significand 112 (23 bits), a biased exponent 114 (8 bits), and a sign bit 116. The base for the floating point number (2 in this case) is raised to the power of the exponent and multiplied by the significand to arrive at the number represented. In microprocessors, base 2 is most common. The significand comprises a number of bits used to represent the most significant digits of the number. Typically, the significand comprises one bit to the left of the radix point and the remaining bits to the right of the radix point. A number in this form is said to be xe2x80x9cnormalizedxe2x80x9d. In order to save space, in some formats the bit to the left of the radix point, known as the integer bit, is not explicitly stored. Instead, it is implied in the format of the number.
Floating point values may also be represented in 64-bit (double precision) or 80-bit (extended precision) format. As with the single precision format, a double precision format value is represented by a significand (52 bits), a biased exponent (11 bits), and a sign bit. An extended precision format value is represented by a significand (64 bits), a biased exponent (15 bits), and a sign bit. However, unlike the other formats, the significand in extended precision includes an explicit integer bit. Additional information regarding floating point number formats may be obtained in IEEE Standard 754.
The recent increased demand for graphics-intensive applications (e.g., 3D games and virtual reality programs) has placed greater emphasis on a microprocessor""s floating point performance. Given the vast amount of software available for x86 microprocessors, there is particularly high demand for x86-compatible microprocessors having high performance floating point units. Thus, microprocessor designers are continually seeking new ways to improve the floating point performance of x86-compatible microprocessors.
One technique used by microprocessor designers to improve the performance of all floating point instructions is pipelining. In a pipelined microprocessor, the microprocessor begins executing a second instruction before the first has been completed. Thus, several instructions are in the pipeline simultaneously, each at a different processing stage. The pipeline is divided into a number of pipeline stages, and each stage can execute its operation concurrently with the other stages. When a stage completes an operation, it passes the result to the next stage in the pipeline and fetches the next operation from the preceding stage. The final results of each instruction emerge at the end of the pipeline in rapid succession.
Typical pipeline stages in a modern microprocessor include fetching, decoding, address generation, scheduling, execution, and retiring. Fetching entails loading the instruction from the instruction cache. Decoding involves examining the fetched instruction to determine how large it is, whether or not it requires an access to memory to read data for execution, etc. Address generation involves calculating memory addresses for instructions that access memory. Scheduling involves the task of determining which instructions are available to be executed and then conveying those instructions and their associated data to the appropriate execution units. The execution stage actually executes the instructions based on information provided by the earlier stages. After the instruction is executed, the results produced are written back either to an internal register or the system memory during the retire stage.
Yet another technique used to improve performance is out-of-order execution. Out-of-order execution involves reordering the instructions being executed (to the extent allowed by dependencies) so as to keep as many of the microprocessor""s floating point execution units as busy as possible. As used herein, a microprocessor may have a number of execution units or pipelines (also called functional units/pipelines), each optimized to perform a particular task or set of tasks. For example, one execution unit may be optimized to perform integer addition, while another execution unit may be configured to perform floating point addition.
Another popular technique used to improve floating point performance is parallel execution. Parallel execution allows more than one instruction to be executed per clock cycle. This is accomplished by having multiple execution pipelines. For example, an addition instruction may be executed in an addition execution pipeline at the same time that a multiply instruction is executed in a multiply execution pipeline. Microprocessors and floating point units that support parallel execution and pipelining are often referred to as xe2x80x9csuperscalarxe2x80x9d because they are able to execute more than one instruction per clock cycle.
One potential source of performance problems for superscalar floating point units that execute instructions out of order is the x86 instruction FLDCW (load floating point control word). FLDCW instructions load new settings into the floating point unit""s control word. These settings are then used to determine how instructions following the FLDCW instruction are executed (e.g., which rounding mode to use and what precision the results will be in).
FIG. 3 shows a diagram of an x86 compatible floating point control word (FPCW) 344. Control bits 120-130 dictate whether certain exceptions are masked or not. When a particular type of exception is masked, the floating point unit will respond using automatic masked exception handling routines that are built into the floating point unit. These automatic handling routines typically generate the most reasonable result for each condition and are used in the majority of cases. If, however, the automatic handling routine is inadequate, the user may unmask the particular exception that is of interest and thereby cause the floating point unit to trap to a user-written exception handling routine.
For example, bit 120 is an invalid operation mask bit (IM) that controls whether invalid operation exceptions are masked. If the floating point unit detects an invalid operation (e.g., an instruction causes a floating point register stack overflow) and the IM bit is set, the exception is handled by the floating point unit, which stores a predetermined NaN (not-a-number) constant into the significand of the stack register that is overwritten as a result of the stack overflow (the register""s tag is also set to indicate that an infinite value is stored therein).
Bit 122 is a denormalized operand mask bit (DM) that controls whether denormal operand exceptions are masked. Bit 124 is a divide by zero mask bit (ZM) that controls whether divide by zero exceptions are masked. Bit 126 is an overflow mask bit (OM) that controls whether overflow exceptions are masked. Bit 128 is an underflow mask bit (UM) that controls whether underflow exceptions are masked. Bit 130 is a precision mask bit (PM) that controls whether precision exceptions are masked.
The problem raised by FLDCW instructions in the context of an out-of-order floating point unit is that instructions occurring before the FLDCW in program order must execute using the previous or old values of the FPCW. Similarly, instructions executing after the FLDCW instruction must execute using the new value of the FPCW (as changed by the FLDCW). In non-pipelined in-order floating point units the FLDCW instruction does not present designers any difficulties (i.e., because the FLDCW instruction is executed before any instructions that occur after the FLDCW instruction in program order). However, in a pipelined out-of-order floating point unit, instructions occurring after the FLDCW may potentially be executed before the FLDCW and thereby incorrectly rely upon an old (incorrect) version of the FPCW.
One prior art solution to this problem has been to simply cause an abort (i.e., similar to a branch misprediction) whenever an FLDCW instruction is detected. In this situation, all speculatively generated results are discarded and the floating point unit rebuilds itself from the last known non-speculatively executed instruction. This solution seemed adequate to designers because FLDCW instructions were perceived as occurring relatively infrequently in modern code.
However, in some cases new compilers are using FLDCW instructions more frequently than previously expected. As a result, a more efficient method for dealing with FLDCW instructions in an out-of-order executing floating point unit is desired.
The problems outlined above may at least in part be solved by a microprocessor having a floating point unit (FPU) configured to schedule FLDCW-type instructions xe2x80x9cin orderxe2x80x9d while still allowing other instructions to execute xe2x80x9cout of orderxe2x80x9d. As used herein, FLDCW-type instructions include all floating point instructions that load specified values into a floating point unit""s control word. Both x86 and non-x86 instructions may be included. Furthermore, as used herein the term xe2x80x9cin orderxe2x80x9d refers to executing instructions in original program order, while xe2x80x9cout of orderxe2x80x9d refers to executing instructions in a different order relative to their original program order.
Generally speaking, in one embodiment a floating point unit is contemplated that is configured to schedule instructions older than FLDCW-type instructions before any FLDCW-type instructions are scheduled. The FLDCW-type instructions may act as xe2x80x9cbarriersxe2x80x9d to prevent later occurring instructions from executing before the FLDCW-type instructions. Indicator bits may be used to simplify instruction scheduling in accordance with this scheme.
In some embodiments, copies of the FPU""s floating point control word may also be stored for later use by instructions that have long execution cycles. For example, if an instruction immediately preceding an FLDCW-type instruction requires eight clock cycles to execute, the FLDCW-type instruction may complete execution before the eight clock cycles have elapsed. Once completed, the eight clock cycle instruction would then incorrectly rely upon the newly updated control word. One solution is to delay the execution of the FLDCW-type instruction until the maximum possible instruction latency has elapsed. However, this may not provide the desired performance. Thus, an alternative solution is to store a copy of the old control word before the FLDCW-type instruction completes execution. This copy may provided to any execution units executing long-latency instructions that began execution before the FLDCW-type instruction was executed.
A method and computer system configured to rapidly execute FLDCW-type instructions in an xe2x80x9cout of orderxe2x80x9d context also contemplated. In some embodiments, the method includes receiving a plurality of instructions, wherein at least one of the instructions is an FLDCW-type instruction. Instructions that are older than a first FLDCW-type instruction are selected for scheduling in an out-of-order fashion. The first FLDCW-type instruction itself is only scheduled once it is the oldest remaining instruction ready for execution. Finally, instructions occurring after the first FLDCW-type instruction in program order are scheduled (also in an out-of-order fashion) after the first FLDCW-type instruction has been scheduled.
In some embodiments, indicator bits may be associated with each instruction following an FLDCW-type instruction. Instructions with asserted indicator bits may be ignored during the scheduling process. Once the preceding FLDCW-type instruction is scheduled, the indicator bits may be cleared (until another FLDCW-type instruction is reached). The instructions with cleared indicator bits may then be considered during the scheduling determination. In some implementations, the method may include waiting one or more clock cycles before scheduling any instructions after the first FLDCW-type instruction has been scheduled. This may allow the FLDCW-type instruction to execute and update the floating point unit""s speculative floating point control word (FPCW) before other instructions needing the updated FPCW are executed.
As previously noted, a temporary copy of the current FPCW may also be stored for long latency instructions. For example, square root instructions are typically performed using a number of iterations. Thus, square root instructions may require a large number of clock cycles to complete execution. If an FLDCW-type instruction closely follows a square root instruction, the square root instruction may incorrectly perform its final iterations using the newly updated FPCW if a copy of the old FPCW is not retained.
A microprocessor configured to rapidly execute FLDCW-type instructions is also contemplated. In some embodiments, the microprocessor may be configured with an instruction cache configured to receive and store a plurality of instructions. A subset of the instructions may be floating point and FLDCW-type instructions. The instruction cache may be coupled to a floating point unit configured to receive the floating point instructions from said instruction cache. The floating point unit may include a scheduler configured to receive, store, and schedule floating point instructions for execution. The scheduler may be configured to select instructions older than a pending FLDCW-type instruction for scheduling (in an out-of-order fashion). The scheduler may wait to schedule the FLDCW-type instruction until it is the oldest remaining instruction in the scheduler that is ready for execution. Once the FLDCW-type instruction is scheduled, the scheduler may then begin scheduling instructions occurring after the FLDCW-type instruction (also in an out-of-order fashion). As previously noted, the scheduler may utilize indicator bits to track which instruction may be considered for scheduling.
A computer system configured to rapidly execute FLDCW-type instructions in an out-of-order context is also contemplated. In one embodiment, the computer system may comprise a system memory, a communications device for transmitting and receiving data across a network, and one or more microprocessors coupled to the memory and the communications device. The microprocessors may advantageously be configured as described above.