In design of modern computers, fused floating-point multiply-accumulate (FMA) calculations have been an area of significant commercial interest and academic research from at least as early as about 1990. A fused FMA calculation is an arithmetic operation of a form ±A*B±C, wherein A, B and C are floating point input operands (a multiplicand, a multiplier, and an accumulator, respectively), and wherein no rounding occurs before C is accumulated to a product of A and B. The notation ±A*B±C includes but is not limited to the following cases: (a) A*B+C; (b) A*B−C; (c) −A*B+C; (d) −A*B−C; (e) A*B (i.e., C is set to 0); and (f) A+C (i.e., where B is set to 1.0).
IBM's RISC System/6000 ca. 1990 provided an early commercial implementation of this arithmetic capability as an atomic, or inseparable, calculation. Subsequent designs optimized the FMA calculation.
In their 2004 article “Floating-Point Multiply-Add-Fused with Reduced Latency,” authors Tomas Lang and Javier D. Bruguera (“Lang et al.”) taught several important aspects related to optimized FMA design, including: precalculation of an exponent difference and accumulator shift/align amount, alignment of accumulator in parallel with a multiply array, use of 2's complement accumulator when necessary, conditional inversion of Sum & Carry vectors, normalization of Sum & Carry vectors before a final add/round module, overlapping operation of LZA/LOA with a normalization shift, separate calculation of carry, round, guard, & sticky bits, and the use of a dual sum adder having a 1m width (where m is the width of a mantissa of one of the operands) in a unified add/round module.
In their 2005 article “Floating-Point Fused Multiply-Add: Reduced Latency for Floating-Point Addition,” authors Tomas Lang and Javier D. Bruguera (“Lang et al. II”) taught the use of a split (or double) data path separating alignment from normalization cases, wherein a “close” data path was used for effective subtractions with exponent difference among {2,1,0,−1} (a concept further developed and significantly improved upon in the detailed description), and a “far” data path was used for all remaining cases. Lang et al. II also taught use of dual alignment shifters in the far data path for the carry-save output of the multiply array, and a very limited alignment shift in the close data path.
In the 2004 article “Multiple Path IEEE Floating-Point Fused Multiply-Add,” author Peter-Michael Seidel (“Seidel”) taught that other enhancements to FMA design may be realized by considering multiple parallel computation paths. Seidel also taught deactivation of gates on paths that are not used; determination of multiple computation paths from exponent difference and effective operation; use of two distinct computation paths, one for small exponent differences wherein mass cancellation may occur, and another for all other cases; the insertion of the accumulator value into the significant product calculation for cases corresponding to small exponent differences with effective subtraction.
Present day ubiquity of personal, portable computing devices that provide extensive media delivery and internet content access require even further efforts to design FMA logic that is cheaper to produce, consumes significantly less power and energy, and permits a higher throughput of instruction results.
The predominant approach to performing an FMA operation involves the use of unified multiply-accumulate units to perform the entire FMA operation, including rounding the result. Most academic proposals and commercial implementations generally describe a monolithic, or atomic, functional unit having the capability to multiply two numbers, add the unrounded product to a third operand, the addend or accumulator, and round the result.
An alternative approach is to use a conventional multiply unit to perform the A*B sub-operation and then a conventional add unit to accumulate C to the product of A and B. But this conventional split-unit approach sacrifices the speed and performance gains that can be obtained by accumulating C with the partial products of A and B in the same unit. The conventional split-unit approach also involves two rounding operations. The product of A and B is rounded and then the accumulation of C to the products of A and B is rounded. Accordingly, the conventional split-unit approach sometimes produces a different and less accurate result than the unified approach. Also, because of its double-rounding operation, the conventional split-unit approach cannot perform a “fused” FMA operation and does not comply with the IEEE 754 technical standard for floating-point computations.
Because FMA hardwares may serve multiple computing purposes and enable compliance with IEEE 754, computer designers frequently seek to entirely replace prior separate multiply and add functional units with atomic FMA execution units in modern products. However, there are multiple detriments to this approach.
First, the implementation cost of an FMA hardware is generally more, and the implementation more complex, than separate multiply and add functional units. Second, when performing a simple addition or multiplication, the latency through an FMA hardware is greater than a separate add or multiply functional unit and generally consumes more power. Third, the combination of multiply and add capabilities into one functional unit, in a superscalar computer processor design, reduces the number of available ports to which an arithmetic instruction may be dispatched, reducing the computer's ability to exploit parallelism in source code, or machine level, software.
This third detriment may be addressed by adding more functional units, such as a stand-alone adder functional unit, which further increases implementation cost. Essentially, an additional adder (for example) becomes the price of maintaining acceptable instruction level parallelism (ILP) while providing atomic FMA capability. This then contributes to increased overall implementation size and increased parasitic capacitance and resistance. As semiconductor manufacturing technology trends toward smaller feature sizes, this parasitic capacitance and resistance contributes more significantly to the timing delay, or latency, of an arithmetic calculation. This timing delay is sometimes modelled as a delay due to “long wires.” Thus, the addition of separate functional units to compensate for diminished ILP with atomic FMA implementations provides diminishing returns relative to die space required, power consumption, and latency of arithmetic calculation.
As a result, the best proposals and implementations generally (but not always) provide the correct arithmetic result (with respect to IEEE rounding and other specifications), sometimes offer higher instruction throughput, increase cost of implementation by requiring significantly more hardware circuits, and increase power consumption to perform simple multiply or add calculations on more complex FMA hardware.
The combined goals of modern FMA design remain incompletely served.