An FMA operator takes three signed floating-point operands having the same precision format, two multiplicands A and B, and an addend C. It calculates the term AB+C to produce a result R in the same precision format as the operands. A specificity of such an operator is that it achieves a single rounding operation at the end of the calculation, which increases the precision by the fact that all the bits are preserved in the intermediate calculations. This operator is particularly useful when the product AB and the operand C have opposite signs but very close absolute values, producing a result in which significant bits fall out of the precision scope of the output of the circuit, but that can be recovered, by adjusting the exponent, from the results of intermediate calculations.
The operands A, B and C are in a floating-point format according to the IEEE-754 standard, comprising a sign bit S, an exponent E and a mantissa m. This standard defines several precision formats, including “binary32” with one sign bit, 8 exponent bits and 23+1 mantissa bits (one implicit bit being encoded in the exponent); “binary64” with an exponent of 11 bits and a mantissa of 52+1 bits; and “binary128” with a 15-bit exponent and a mantissa of 112+1 bits.
When possible, a floating-point number is represented as “normal”, i.e. with a sign bit, a “1” before the decimal point, p−1 bits after the decimal point (where p is the size of the mantissa), and an exponent indicating the position of the decimal point with respect to the unit. The “1” before the decimal point is implicit and is not conveyed. It follows that p−1 bits define a p-bit mantissa for a normal number, the p-th bit being the implicit bit.
The exponent E is encoded with an unsigned number (0 to 255 for the binary32 format). For normal numbers, the real exponent e is obtained by subtracting a bias b: e=E−b. This bias is defined by the standard, e.g. b=127 for the binary32 format.
The normal numbers are identified by E>0. Thus the smallest normal number representable in binary32 format is 1.00 . . . 0·2−126.
When E=0, the number is called “subnormal”. The real exponent is then e=1−b (which is the value of the smallest normal exponent), but the implicit bit of the mantissa is 0 instead of 1. For example, in the binary32 format, subnormal numbers are of the form ±0.XX . . . X·2−126. The value 0 is also represented in this manner.
FIG. 1 schematically shows a conventional FMA operator, as described in [“Handbook of Floating Point Arithmetic”, Jean Michel Muller et al., Chaps. 8.5, 9.5]. The mantissas ma and mb of operands A and B, having p bits (including the bit before the decimal point), are provided to a multiplication tree MULT. This tree produces two 2p-bit outputs provided to two inputs of a carry-save adder CSA. A third input of the CSA adder receives 2p bits from a right-shift circuit R-SHIFT. The CSA adder produces the result as a partial sum s and a carry sequence c, on 2p and 2p+1 bits respectively. The outputs s and c of the adder are left-padded, respectively, by the p+5 most significant bits of the output of the R-SHIFT circuit, and p+4 zero bits, thereby forming two 3p+5-bit numbers applied to the two inputs of a full adder 10.
The R-SHIFT circuit receives as most significant bits the mantissa mc of the addend, right-padded with 3p+4 zero bits. The size of the shift circuit is thus 4p+4 bits. The p−1 least significant bits are not used further in the computations—a logical OR of all these bits forms a so-called “sticky bit” that condenses their useful information: whether they are all zero or not. The remaining 3p+5 bits can all be used in computations: the 2p least significant bits are supplied to the CSA adder, and the remaining p+5 most significant bits are provided to the full adder 10.
The signs s and the exponents e of the operands are provided to a processing circuit 12 that produces the shift amount for the R-SHIFT circuit, here equal to ea+eb−ec+p+4. According to the sign information, the circuit 12 negates the mantissa mc through a circuit NEG placed upstream the R-SHIFT circuit.
The output of the adder 10, also on 3p+5 bits, is provided to a normalization circuit NORM. Normalization essentially consists in performing a left-shift of the result, to bring the first significant “1” to the front of the number (as required by the standard), and readjusting the exponent accordingly. The exponent and left-shift amount are determined by the terms ea+eb or ec transmitted by the circuit 12, and depend on the alignment position of the mantissa with respect to the product mamb. A leading zero anticipation circuit LZA provides additional information for the operation. The LZA circuit estimates, from the 2p+3 least significant bits of each of the inputs of the adder 10, the number of leading zeros at the output of the adder. This number of leading zeros corresponds to a further left-shift to be applied, and to a compensation to apply to the exponent.
The estimate made by the LZA circuit is accurate to one bit, depending on whether the addition propagates a carry or not on the most significant bit. To remove this uncertainty, a further normalization step is achieved, but it is very fast, because it is performed on only one bit. This procedure is used, because the estimation of the number of leading zeros in the shown configuration may thus be done in parallel with the calculation of the adder 10, and does not introduce additional latency in the production of the final result.
The circuit NORM provides a result on p+1 bits to a rounding circuit RND that produces the final result R rounded to p bits. The non-zero bits remaining behind the p+1 bits are condensed with the sticky bit from the R-SHIFT circuit. This sticky bit participates in the calculation of the rounding with the least significant bit of the output of the normalization circuit.
The circuit RND also determines whether the sign of the result should be reversed, according to information produced by the processing circuit 12 and the sign of the result of the adder (in cases where the sign is not predictable prima facie, for example in AB+C where AB is positive and C negative). In the shown configuration, in calculating for example AB−C (where A, B and C are all positive), the circuit NEG reverses the sign of the mantissa mc, and the sign of the result is not changed. On the other hand, in calculating C−AB, the sign of mantissa mc is again reversed, and the sign of the result R is also reversed (i.e. −(AB−C) is calculated).
FIG. 2 illustrates the operation of the circuit of FIG. 1 at various stages of the flow of the processed numbers. The shaded areas represent the case where AB and C are close (ea+eb−ec=0). The dimensions correspond to p=5. This figure was adapted from the manual mentioned above.
In the upper part, the mantissa mc is introduced into the R-SHIFT circuit. Its shifted version (mc)s is dynamically positioned in a range of 4p+4 bits according to the exponents. The shaded areas indicate the position corresponding to ea+eb−ec=0. The most significant bit of the shifted mantissa (mc)s is aligned with the most significant bit of the shifted product (mamb)s. The product (mamb)s is actually statically shifted by hard-wiring the corresponding input of the adder 10, so that the mantissa (mc)s may, through the R-SHIFT circuit, be shifted at the other input of the adder 10 between a position fully to the left of the product (mamb)s and a position fully to the right. A number of additional bits are provided in the shift range, here four, including three to the left and one to the right of the product, in particular to take account of carry propagations, rounding bits, and uncertainty about the position of the decimal point (e.g. the product of the mantissas of two normal numbers has two bits before the decimal point, where the most significant bit may be zero).
A fifth row shows the output of the adder 10. The sum (mamb)s+(mc)s, shown in gray, because it can produce a carry-out, occupies one more bit to the left. Leading zero anticipation (LZA) is performed on the set of potential bits forming the sum up to the right-most bit of the input of the adder 10.
The lower part of the figure corresponds to the normalization operation. The output R before rounding is taken on the p+1 most significant bits of the sum (mamb)s+(mc)s.
All operands of an FMA operator use the same format. If a high precision of the result is needed, the FMA operator will be chosen to use the binary64 format, or even the binary128 format, rather than the binary32 format.
In many applications, it is desirable to have an intermediate precision for the calculations, greater than the precision of the inputs. This overcomes issues with the accumulation of rounding errors. For example, it is common to have to calculate the sum of a series of products, which is achieved by multiply-accumulate iterations. In this case, it is desirable to calculate the accumulation in a higher precision format than the input format. For example, for binary32 inputs, the accumulation would be calculated in the binary64 format.
In such an application, the FMA operator is sized to handle operands having the maximum desired precision, binary64 or even binary128. The binary32 operands are then converted to the binary64 format, for instance, before they are applied to the FMA operator. Program instructions exist for such a conversion in software. The conversion is not trivial: it consists in mapping a 24-bit mantissa on a 53-bit mantissa, mapping an 8-bit exponent on an 11-bit exponent using a different bias, and converting subnormal numbers of binary32 format to normal numbers of binary64 format, since all subnormal numbers of binary32 format can be represented as normal in binary64 format.
As a result, the FMA operator is underutilized for the multiplication operands, and its use is made more complex for the format conversion needs.