Numbers may be represented within a computer system in a variety of ways. In an extended integer format, for example, a 64-bit register may store numbers ranging from 0 to 263−1 (the same size register may also store signed numbers by giving up one order of magnitude in range). This format is limiting, however, because it is incapable of representing numbers that are not integers (i.e., numbers that do not have a binary point to the right of the least significant bit in the register).
To accommodate non-integer numbers, a fixed point representation may be used. In this form of representation, the binary point is considered to be somewhere other than to the right of the least significant bit. For example, a 64-bit register may be used to store values from 0 (inclusive) to 2 (exclusive) by processing register values as though the binary point is located to the right of the most significant register bit. Such a representation allows, in the example, 63 register bits to represent fractional values. In other implementations, one bit may be used as a sign bit so that a register can store values between −2 and +2.
Because the binary point is fixed within a register or storage location during fixed point arithmetic operations, numbers with differing orders of magnitude may not be represented with equal precision without scaling. For example, it is not possible to represent both 1001b (13 in decimal) and 0.1101b (0.8125 in decimal) using the same fixed point representation. While fixed point representation schemes are still quite useful, many applications require a larger dynamic range (i.e., the ratio of the largest number representation to the smallest, non-zero, number representation in a given format).
In order to solve this problem of limited dynamic range, floating point representation and arithmetic is widely used. Generally speaking, floating point representations include three parts: a sign bit, and unsigned fractional number (i.e., the mantissa), and an exponent value. One example of a floating point representation format commonly used, IEEE standard 754, is depicted in FIG. 1.
Referring to FIG. 1, a floating point representation format 2 is shown. Format 2 includes a sign bit (denoted as S) 4, an exponent portion (denoted as E) 6, and a mantissa portion (denoted as F) 8. Floating point values represented in this format have a value V, where V=(−1)S×2E-bias×(1.F). Sign bit S represents the sign of the entire number, while mantissa portion F is a 52-bit number with an implied leading 1 bit (values with a leading one bit are said to be “normalized”). In other implementations, the leading one bit may be explicit. Exponent portion E is an 11-bit value that represents the true exponent of the number V offset by a predetermined bias. A bias is used so that both positive and negative true exponents may be easily compared. The number 127 is used as the bias in IEEE standard 754. Format 2 may thus accommodate numbers having exponents from −1023 to +1024. Floating point format 2 advantageously allows 53 bits of representation within each of these orders of magnitude.
Floating point addition is an extremely common operation in numerically intensive applications (floating point subtraction is accomplished by complementing one of the inputs and performing addition). Although floating point addition is related to fixed point addition, two differences cause complications. First, an exponent value of the results must be determined from the input operands. Secondly, rounding must be performed. The IEEE standard specifies that the result of an operation should be the same as is the result were computed exactly, and then rounded (to a predetermined number of digits) using the current rounding mode. IEEE standard 754 specifies four rounding modes: round to nearest, round to zero, round to +∞, and round to −∞. The default mode, round to nearest, chooses the even number in the event of a tie.
Referring to FIG. 2, a flow process of a typical floating point unit 10 is shown. All stages in the floating point unit 10 are not performed for all possible additions (i.e., some steps are optional depending on the inputs and user commands). The stages of the floating point unit 10 are described below with reference to input values A and B. Input value A has a sign bit AS, an exponent value AE, and a mantissa value AF. Input value B has a sign bit BS, an exponent value BE, and a mantissa value BF.
The floating point unit 10 first includes a stage 12, in which an exponent difference Ediff is calculated between AE and BE. In some implementations, if Ediff is calculated to be negative, operands A and B are switched such that A becomes the larger operand. In the implementation of FIG. 2, the operands are switched such that Ediff is always positive.
In stage 14, operand A is aligned. This is accomplished by shifting operand B Ediff bits to the right. In this manner, the mantissa portions of both operands are scaled to the same order of magnitude. If AE=BE, no shifting is performed. If Ediff>0, however, information must be maintained with respect to the bits that are shifted rightward (and are thus no longer representable within the predetermined number of bits). In order to perform IEEE rounding, information is maintained relative to 3 bits: the guard bit (G), the round bit (R), and the sticky bit (S). The guard bit is one bit less significant than the least significant bit (L) of the shifted value, while the round bit is one bit less significant than the guard bit. The sticky bit is the logical-OR of all bits less significant than the round bit. In the final stage of rounding after normalization (described below), the bit next to the least significant bit of the result becomes the round bit (R) and the bit after this is the sticky bit (S).
In stage 16, the shifted version of operand B is complemented, if needed, to perform subtraction. In some implementations, the signs of the input operands and the desired operation are examined in order to determined whether effective addition or effective subtraction is occurring. In one implementation, effective addition, EA, is given by EA=AS⊕BS⊕op, where op is 0 for addition and 1 for subtraction. For example, the operation A minus B, where B is negative, is equivalent to A plus B (ignoring the sign bit of B). Therefore, effective addition is performed. The complementation in stage 16 may be either of the one's complement or two's complement variety.
In stage 18, the addition of operand A and operand B is performed. As described above, operand B may be shifted and complemented as needed. Subsequently, in stage 22, the result of stage 18 is normalized. This includes left-shifting the result of stage 18 until the most significant bit is a 1. The bits that are shifted in are calculated according to the values of result, G, and R. In stage 24, the normalized value is rounded according to nearest rounding mode. If S includes the R bit OR'ed in, round to nearest, RTN, is given by RTN=G(L+S). If the rounding performed in stage 24 produces an overflow, the result is post-normalized (right-shifted) in stage 26.
As can be seen from the description of the floating point unit 10, floating point operations are quite complicated. However, because floating point operations provide for a larger dynamic range than integer and fixed point numbers, conversion among the various formats is often necessary. For example, most architectural schemes include operations for converting an integer number to a floating point number. Accordingly, the conversion from integer format to floating point format is typically done in a floating point unit. However, because large integer formats require large zero/one detect logic and large shifting logic, some of the conversion takes place in software after the floating point unit issues an unfinished trap signal. In some implementations, such a potential performance problem may be solved using a separate convert datapath that can accommodate larger bit representation formats and logic. However, implementation of such a convert datapath requires substantial amounts of extra hardware and increases cycle time.