1. Field of the Invention
This invention relates generally to the field of microprocessors and, more particularly, to floating point units within microprocessors.
2. Description of the Related Art
Most microprocessors must support multiple data types. For example, x86-compatible microprocessors must execute two types of instructions; one set defined to operate on integer data types and another set defined to operate on floating point data types. In contrast with integers, floating point numbers have fractional components and are typically represented in exponent-significand format. For example, the values 2.15 and xe2x88x9210.5 are floating point numbers while the numbers xe2x88x921, 0, and 7 are integers. The term xe2x80x9cfloating pointxe2x80x9d is derived from the fact that there is no fixed number of digits before and after the decimal point, i.e., the decimal point can float. Using the same number of bits, the floating point format can represent numbers within a much larger range than integer format. For example, a 32-bit signed integer can represent the integers between xe2x88x92231 and 231 xe2x88x921 (using two""s complement format). In contrast, a 32-bit (xe2x80x9csingle precisionxe2x80x9d) floating point number as defined by the Institute of Electrical and Electronic Engineers (IEEE) Standard 754 has a range (in normalized format) from 2xe2x88x92126 to 2127xc3x97(2xe2x88x922xe2x88x9223) in both positive and negative numbers.
FIG. 1 illustrates an exemplary format for an 8-bit integer 100. As the figure illustrates, negative integers are represented using the two""s complement format 106. To negate an integer, all bits are inverted to obtain the one""s complement format 102. A constant 104 of one is then added to the least significant bit (LSB).
FIG. 2 shows an exemplary format for a floating point value. Value 110 a 32-bit (single precision) floating point number. Value 110 is represented by a significand 112 (23 bits), a biased exponent 114 (8 bits), and a sign bit 116. The base for the floating point number (2 in this case) is raised to the power of the exponent and multiplied by the significand to arrive at the number represented. In microprocessors, base 2 is most common. The significand comprises a number of bits used to represent the most significant digits of the number. Typically, the significand comprises one bit to the left of the radix point and the remaining bits to the right of the radix point. A number in this form is said to be xe2x80x9cnormalizedxe2x80x9d. In order to save space, in some formats the bit to the left of the radix point, known as the integer bit, is not explicitly stored. Instead, it is implied in the format of the number.
Floating point values may also be represented in 64-bit (double precision) or 80-bit (extended precision) format. As with the single precision format, a double precision format value is represented by a significand (52 bits), a biased exponent (11 bits), and a sign bit. An extended precision format value is represented by a significand (64 bits), a biased exponent (15 bits), and a sign bit. However, unlike the other formats, the significand in extended precision includes an explicit integer bit. Additional information regarding floating point number formats may be obtained in IEEE Standard 754.
When a numeric value approaches zero, normalized floating-point format may not be able to express the value accurately. To accommodate these instances, x86-compatible microprocessors support a xe2x80x9cdenormalxe2x80x9d format in which the significand contains one or more leading zeros. Denormal values have biased exponents fixed at their smallest possible value (i.e., zero). The leading zeros of denormals permit smaller numbers to be represented.
FIG. 3 shows a denormal value 130 in single precision format. As the figure illustrates, denormal values have a biased exponent 134 equal to zero and a non-zero significand 132. Denormals may be positive or negative (as indicated by sign bit 136).
Microprocessors that are x86 compatible and support floating point instructions must be able to load, store, and operate on denormalized real numbers. This presents several problems for microprocessor designers. One problem in particular relates to loading and manipulating the denormal value in the floating point unit. To improve performance, microprocessors are typically designed with a number of xe2x80x9cexecution unitsxe2x80x9d that are each optimized to perform a particular set of functions or instructions on a particular data type. For example, one or more execution units within a microprocessor may be optimized to perform arithmetic functions on integer values, while a second set of execution units may be optimized to perform arithmetic functions on floating point values. These floating point execution units (combined with their supporting control logic) may be collectively referred to as the microprocessor""s xe2x80x9cfloating point unitxe2x80x9d.
Most floating point units translate floating point numbers into a processor-specific internal format before the numbers are operated upon. Using one format for all internal floating point calculations advantageously reduces the complexity of the floating point unit""s execution units.
FIG. 4 shows one possible internal floating point format 170 comprising a 68-bit significand 172, an 18-bit biased exponent 174, and a sign bit 176. The use of a single internal floating point format tends to simplify the hardware used to implement the floating point unit. For example, instead of having to process three different formats (i.e., single-precision, double-precision, and extended precision), the floating point processor may translate all floating point values into extended precision format or an internal format. Once the desired operations have been performed, the results are then translated back to the desired format.
The problem denormal values pose to designers relates to translating denormals into this internal format. Normal values may be translated by simply shifting in constant zeros and adjusting the exponent. This conversion process may be performed in a single clock cycle. With denormals, however, the conversion process takes longer because the number must be normalized after the constants are shifted in. For example, in some microprocessors at least two clock cycles are needed to convert the denormal to a normalized internal format.
Since the number of clock cycles needed to process normals and denormals varies, designers are left with a quandary. The designers can make all loads take two clock cycles, but this is undesirable because normal loads are more common than denormal loads. Thus, overall microprocessor performance may suffer due to the unnecessary additional latency incorporated into normal loads.
Another alternative that has been used by designers is to detect the denormal, stall the pipeline, and then trap to microcode to convert the denormal. Yet another alternative is to tag the denormal and then convert it later when it reaches an execution unit. However, these solutions are slow (i.e., the original instruction may need to be re-executed after the denormal is converted) and may reduce the throughput of floating point operations when even a few denormal loads are experienced. Thus an efficient method for rapidly handling denormal loads is desired.
The problems outlined above may at least in part be solved by a microprocessor configured to dynamically switch its floating point load pipeline length from one stage in length to more than one stage in length. In one embodiment, the microprocessor may accomplish this by performing normal loads and detect denormal loads in a single clock cycle. The microprocessor may temporarily store each floating point instruction in a reissue buffer for at least one clock cycle in anticipation of a denormal load. When a denormal load is detected, the microprocessor is configured to add one or more stages to the floating point load pipeline (e.g., adding a normalization stage to the conversion stage) to allow the denormal value to complete the conversion to internal format. The longer pipeline will then be used for all loads that follow the denormal load until there is a clock cycle without a load (e.g., an idle clock cycle or an clock cycle in which an abort occurs). At that point, the pipeline reverts to its original single stage format. In addition, the microprocessor may be configured to cancel any recently scheduled instructions (e.g., those were scheduled assuming the denormal load would take only one clock cycle to complete). The canceled instructions are then xe2x80x9creplayedxe2x80x9d (i.e., rescheduled) during a later clock cycle from the reissue buffer. Advantageously, this configuration allows the common case of a normal load to be performed using a short pipeline, while still providing proper handling of the less frequent case of a denormal load.
In one embodiment, the microprocessor may be configured with a floating point classification unit, a floating point conversion unit, and a reissue buffer. The classification unit is configured to receive floating point data from floating point load operations and then determine the floating point data""s type. For example, the classification unit may determine whether the floating point data is normal or denormal. The classification unit may also be configured to assert a denormal control signal if the floating point data is denormal.
The conversion unit is configured to receive floating point data that was read from memory as the result of floating point load instructions. The conversion unit is then configured to convert the floating point data from the format that is was originally stored in memory into a predetermined internal format. For example, the predetermined format may be extended precision or a processor-internal format having additional bits allocated for the significand and the exponent. Advantageously, the internal format may allow the representation of denormal values in a normalized form. The conversion unit is configured to convert the floating point data to the predetermined format in a first number of clock cycles (e.g., one clock cycle) if the floating point data is normal. If the floating point data is denormal, however, the conversion unit is configured to use a second larger number of clock cycles (e.g., two clock cycles) to allow for normalization of the denormal value.
The reissue buffer is configured to store floating point instructions as they are scheduled for execution. For example, if three instructions are scheduled for execution in a particular cycle, those three instructions are stored in the reissue buffer. The instructions are stored for at least one clock cycle. Upon receiving the asserted denormal control signal, the reissue buffer may xe2x80x9creplayxe2x80x9d or reschedule the stored instructions during a subsequent clock cycle. The asserted denormal control signal may also serve as a cancel signal to prevent the originally scheduled instruction from completing or from storing its results.
In another embodiment, the microprocessor may further comprise a scheduling unit capable of scheduling instructions in an out of order fashion. The scheduling unit may be configured to schedule floating point instructions for execution (once the instructions"" operands are ready) assuming the conversion unit will perform conversions in the first (smaller) number of clock cycles. In some embodiments, the scheduling unit may be configured to cancel one or more recently scheduled instruction upon receiving an asserted denormal control signal. Canceling the instructions is desirable because they were scheduled assuming the conversion of data (which they may depend upon) would be completed in the first number of clock cycles. The scheduling unit may then replay the canceled instruction (using the information stored in the reissue buffer) during a subsequent clock cycle when their corresponding floating point data is actually available in the normalized internal format.
In the conversion unit, the number of clock cycles used may correspond to the number of pipeline stages used to perform the conversion process. For example, the conversion unit may be configured to employ a first number of pipeline stages to convert normal values and a second greater number of pipeline stages to convert denormal values. However, once a denormal has been converted, the conversion unit may be configured to continue to use the second larger number of clock cycles (or pipeline stages) until the conversion unit has an idle clock cycle or receives an abort signal. The conversion unit may then reset itself to use the first number of clock cycles (or pipeline stages). An idle clock cycle may occur when no load data is received by the conversion unit. An abort signal may be received if the microprocessor detects a branch misprediction.
A method for loading denormal floating point values into a microprocessor is also contemplated. In one embodiment, the method comprises reading floating point data from a data bus and then classifying the floating point data as denormal or normal (or another data type, e.g., MMX). If the floating point data is normal, then it is converted to a predetermined normalized internal format in a first number of clock cycles. If, on the other hand, the floating point data is denormal, then the data is converted to the predetermined normalized internal format in a second larger number of clock cycles to allow extra time for normalization. Once a denormal is converted to internal format, however, all subsequent floating point data conversions are then performed using the second larger number of clock cycles. Thus, if a normal floating point value immediately follows a denormal floating point value, then for scheduling purposes the normal floating point value will be available after the second number of clock cycles (even though the conversion unit may only need the first number of clock cycles to convert the value). In this way the conversion unit switches from a short (e.g., one stage) pipeline to a longer (e.g., two stage) pipeline upon detecting a denormal value. The conversion unit then continues to use the longer pipeline until an idle cycle or an abort is received (at which time is resets itself to the shorter pipeline).
The method may further comprise scheduling floating point instructions for execution assuming that all floating point loads will be converted into internal format in the first (smaller) number of clock cycles. Once a denormal format value is decoded, however, scheduling is performed assuming that all floating point loads will be converted into internal format in the second (larger) number of clock cycles. After an idle clock cycle or abort, scheduling once again resumes the assumption that floating point loads be converted into internal format in the first (smaller) number of clock cycles.
The method may further comprise: (a) temporarily storing instructions as they are scheduled for execution; (b) canceling one or more instructions scheduled for execution once a denormal load is detected or classified; and (c) replaying the canceled instructions one or more clock cycles after the instructions were canceled.
A computer system configured to efficiently perform denormal loads is also contemplated. In one embodiment the computer system may comprise a system memory, a communications device for transmitting and receiving data across a network, and one or more microprocessors coupled to the memory and the communications device. The microprocessors may advantageously be configured as described above.