Floating point number representations are commonly used to represent real numbers in digital computing applications. A floating point number has an associated base value, and is described by three integers: a sign value, a significand, and an exponent. The sign value, the significand and the exponent are encoded using binary representations and stored in memory in a defined format, such as the formats defined in the well-known IEEE Standard for Floating-Point Arithmetic 754-2008. In various references, the significand is sometimes referred to as the “mantissa,” the “fraction,” or the “payload.”
Given a number represented in a floating point format, the value of a real number result R is obtained using the following equation:R=(−1)S×M×BE  (1)where B is the base (typically 2 or 10), S is the sign bit and has a value of zero for positive numbers or one for negative numbers, E is the exponent and M is the significand. For example, if the base is B=10, the sign is S=1 (indicating negative), the significand is M=12345, the exponent is E=−3, and, then the value of the resulting real number is R=−12.345.
For many years most digital computing systems encoded floating point numbers using a binary floating point format having a base of B=2 (as defined in IEEE 754-1985). This format is still in predominant use in most desktop computers. The new 2008 version of this standard (IEEE 754-2008) introduces decimal floating point formats that are based on a base of B=10.
Tables 1 and 2 give the number of significant figures in the significand, together with the range of supported exponent values (Emin≦E≦Emax) for the binary and decimal floating point formats, respectively, defined in IEEE 754-2008.
TABLE 1Standard binary floating point formats (B = 2).binary16binary32binary64binary128significant digits112453113Emax+15+127+1023+16383Emin−14−126−1022−16382
TABLE 2Standard decimal floating point formats (B = 10).decimal32decimal64decimal128significant digits71634Emax+96+384+6144Emin−95−383−6143
As discussed in the article “Decimal Floating-Point: Algorism for Computers” (Proc. 16th IEEE Symposium on Computer Arithmetic, 2003) by Cowlishaw, decimal floating point formats have the advantage that a hand-calculated value will give the same result as a computer-calculated result. However, defining a new floating point format causes difficult compatibility issues with older floating point formats.
To convert from a first floating point format to a second floating point format, it is necessary to solve for a new significand and a new exponent that will give the equivalent real number. Mathematically, this corresponds to:M1×B1E1=M2×B2E2  (2)where the subscript “1” corresponds to the first floating point format having a first base B1, and the subscript “2” corresponds to the second floating point format having a second base B2. Accordingly, E1 is a first exponent and M1 is a first significand for the first floating point format, and E2 is a second exponent and M2 is a second significand for the second floating point format.
One way to solve Eq. (2) for the second exponent E2 and the second significand M2 would be to let M2=M1 and solve the equation for E2:
                                          (                          E              2                        )                    real                =                                            E              1                        ×                          log                              B                2                                      ⁢                          B              1                                =                                    E              1                        ×                                          log                ⁢                                                                  ⁢                                  B                  1                                                            log                ⁢                                                                  ⁢                                  B                  2                                                                                        (        3        )            where the logarithms in the log B1/log B2 term have an arbitrary base. However, both E1 and E2 must be stored as integers. Therefore in practice, E2 is set to the integer portion of this quantity:
                              E          2                =                              Int            ⁡                          [                                                E                  1                                ×                                  log                                      B                    2                                                  ⁢                                  B                  1                                            ]                                =                      Int            ⁡                          [                                                E                  1                                ×                                                      log                    ⁢                                                                                  ⁢                                          B                      1                                                                            log                    ⁢                                                                                  ⁢                                          B                      2                                                                                  ]                                                          (        4        )            where the operator Int[A] gives an integer portion of a real number A. Therefore, there will be a remainder portion that must be incorporated into the value of M2. It can be shown that the new value of M2 will be:M2=M1×B2Rem[E1×log B1/log B2,1]=M1×bias  (5)where Rem(A,B) is the remainder of (A/B), andbias=B2Rem[E1×log B1/log B2,1]  (6)
The value of E2 determined using Eq. (4) can be calculated quickly in a digital computer using simple fixed point multiplication. Note that since B1 and B2 are constants, the value of log B1/log B2 can be stored as a predefined constant. The difficulty comes with the computation of M2 using Eq. (5). In particular, the exponentiation operation of raising the base B2 to a power is not conducive to simple fixed point arithmetic.
One way to compute the value of M2 is to use a Taylor series expansion of the equation. However, this involves many calculations and has accuracy problems. Most practical implementations pre-compute the value of the bias in Eq. (6) for every possible different E1 and store the results in a look-up table (LUT). However, this approach has the disadvantage that it requires a significant amount of memory. For example, if the first floating point format is the “binary64” format described in IEEE 754-2008, the LUT needs to store 2,046 different values, each of which requires 53 bits of storage memory, for a total of about 13.2 Kbytes of storage memory. Similarly, if the first floating point format is the “binary128” format described in IEEE 754-2008, the LUT needs to store 32,766 different values, each of which requires 113 bits of storage memory, for a total of about 452 Kbytes of storage memory. The appropriate LUT memory needs to be set aside for each pair of formats for which it is necessary to convert. The memory requirements become particularly significant when implementing this conversion in a hardware processor such as a Floating-point unit (FPU).
FIG. 1 shows a flowchart of a LUT-based method for converting from a binary floating point number 10 having an input base B1=2, to a decimal floating point number 85 having an output base B2=10. This basic approach is used in the publically available Decimal Floating-Point Math Library available from Intel Corporation of Santa Clara, Calif.
A decode floating point format step 15 is used to decode the binary floating point number 10 to extract a corresponding input sign value 20 (S1), an input exponent 25 (E1), and an input significand 30 (M1). An output sign value 35 (S2) is simply set to be equal to the input sign value 20 (S1). According to Eq. (4), the input exponent 25, is multiplied by a predetermined constant 45 (log 2/log 10=log102) using a multiplier 40 to compute an output exponent 50 (E2). The multiplier 40 includes the application of an Int[.] operator so that the resulting output exponent 50 (E2) is an integer. An apply bias LUT step 60 is used to determine a bias value 65 by addressing a bias LUT 55 with the input exponent 25 (E1). The bias LUT 55 stores pre-computed bias values 65 for every possible value of the input exponent 25 (E1) according to Eq. (6). (As mentioned above, if the binary floating point number 10 is in the “binary128” format described in IEEE 754-2008, the bias LUT 55 needs to store 32,766 different entries.) The input significand 30 (M1) is multiplied by the bias value 65 using a multiplier 70 to compute the output significand 75 (M2). The combination of the operations associated with the apply bias LUT step 60 and the multiplier 70 implement the computation given in Eq. (4).
A normalize floating point number step 80 is used to normalize the components of the output floating point number according to the requirements of the specific output floating point format. A floating point format specification, such as the aforementioned IEEE 754-2008 standard, requires that the significand satisfy certain conditions before it is encoded. (For example, if the output floating point number is a decimal floating point number, the encoding specification requires that the significand must be an integer.) The normalize floating point number step 80 modifies the output significand 75 (M2) so that it can be correctly encoded. This is done by multiplying or dividing the output significand 75 (M2) by powers of the output base B2 until it satisfies the required conditions. In the case where the output floating point number is a decimal floating point number, the computed significand must be multiplied by powers of ten until all fractional digits are zero (or insignificant). The output exponent 50 (E2) must be decremented or incremented by a corresponding value so that the real number value of the floating point number remains unchanged. The normalize floating point number step 80 may also include a rounding operation to round off any insignificant digits.
An encode floating point format step 85 encodes the output sign value 35 (S2), the output exponent 50 (E2) and the output significand 75 (M2) according to the specification for desired decimal floating point format (e.g., according to the IEEE 754-2008 standard) to produce the decimal floating point number 85.
The method shown in FIG. 1 can easily be adapted to convert from a decimal floating point number to a corresponding binary floating number by making appropriate adjustments to the constant 45 and the values stored in the bias LUT 55. In this case, the value of the constant 45 will be log 10/log 2=log210, and the bias LUT 55 stores pre-computed bias values 65 for every possible value of the input exponent 25 (E1) according to Eq. (6) using an input base of B1=10 and an output base of B2=2.
There remains a need for a method to convert between different floating point formats that is simultaneously accurate, computationally efficient and requires a minimal amount of memory.