The present invention relates in general to data processing systems, in particular, to a unit, method, system and computer program product for performing fused-multiply-add floating-point operations on 128 bit wide operands.
The IEEE-754-2008 Standard for Binary Floating Point Arithmetic, published in 2008, specifies a floating point data architecture that is commonly implemented in computer hardware, such as floating point processors having multipliers. The format consists of a sign, an unsigned biased exponent, and a significand. The sign bit is a single bit and is represented by an “S”. The unsigned biased exponent, represented by an “e”, is for example, 8 bits long for single precision, 11 bits long for double precision and 15 bits long for quadruple precision. The significand is, for instance, 24 bits long for single precision, 53 bits long for double precision and 113 bits long for quadruple precision. As defined by the IEEE-754-2008 standard, the most significant bit of the significand, i.e. the so called implicit bit, is decoded out of the exponent bits.
To improve floating-point arithmetic processing most modern processors use a process called the fused-multiply-add (in the following abbreviated as FMA) process to combine a floating-point multiplication operation, e.g., A×B, and a floating point addition operation, e.g., +C, for execution as a single instruction, e.g., A×B+C, where A, B, C are operands of the multiplication product A×B and the sum of C and the product. By performing two operations in a single instruction, the FMA process reduces overall execution time. The FMA process also provides improved precision because rounding need only be performed after both the multiplication and addition operations are performed at full precision. For instance, there is only one rounding error instead of two.
Analytics applications, especially when running on large data amounts, are very compute intensive. Their main data types are binary floating-point. This includes commercially available analytics software like ILOG, SPSS, Cognos, Algo, and many specialized analytics packages for the insurance and banking sectors.
Many mobile applications require location detection routines, which also are floating-point intensive calculations. Performance of these routines are key in emerging sectors like telematics, which combines mobile input with database queries and insurance analytics codes and has real-time requirements.
With growing problem size, numerical sensitivities of the algorithms are magnified. That degrades the stability of the algorithms and reduces the speed of convergence. This is a well know effect in the high performance arena. The easiest way to address this issue is to switch the mathematically critical routines from double precision to quad precision floating-point (128 bit).
With Big Data Analytics, this numerical stability issue is also hitting the commercial space. For example, convergence issues for very large ILOG® installations and for client's risk assessment codes running on large data sets are noticed. ILOG is a registered trademark of International Business Machines Corporation, Armonk, N.Y., USA For that large ILOG® installations, 15-30% faster convergence is noticed when switching to 128 bit floating-point calculations.
By way of example, US 2016/0048374 A1 discloses techniques for emulating fused-multiply-add (FMA) operations via the use of assist instructions. According to the techniques of this disclosure, FMA operations are emulated via assist instructions such that existing hardware for performing unfused-multiply-add operations may be used to emulate fused-multiply-add operations without requiring other specialized hardware.
Emulating a fused-multiply-add operation for a first operand, a second operand, and a third operand includes determining, by at least one processor, an intermediate value based at least in part on multiplying a first operand with a second operand. Existing methods further include determining, by the at least one processor, at least one of an upper intermediate value or a lower intermediate value, wherein determining the upper intermediate value includes rounding, towards zero, the intermediate value by a specified number of bits, and wherein determining the lower intermediate value includes subtracting the intermediate value by the upper intermediate value. The method further includes determining, by the at least one processor, an upper value and a lower value based at least in part on adding a third operand to one of the upper intermediate value or the lower intermediate value. The method further includes determining, by the at least one processor, an emulated fused-multiply-add result for the first operand, the second operand, and the third operand by adding the upper value and the lower value.
U.S. Pat. No. 9,104,474 B2 discloses methods and circuits for energy efficient floating-point multiply and/or add operations. The embodiments provide energy-efficient variable-precision multiply and/or add operations while keeping track of how many mantissa bits of a floating-point number may be certain and/or provide an energy efficient floating-point multiplication that includes a replay of the multiplication when a lowest portion of a multiplication result could affect the final result.
The variable precision floating-point circuit uses real-time certainty tracking to provide run-time precision selection. The certainty tracking enables low-precision calculations, whose result may be uncertain, to be redone with higher precision if necessary. Because the certainty may be dependent upon the data, it is determined along with the numerical computations. The circuits keeping track of the certainty add minimal overhead, while the majority of calculations produce correct results with lower precisions.
The floating-point multiplication steps are performed by an N-bit by N-bit multiplier (N×N-bit multiplier) circuit including a parallelogram configured to set carries of a predetermined number of least significant bits of a multiplication product to zero for a multiplication operation, and a detection circuit to induce a replay of the multiplication operation by the multiplier to generate a full multiplication result if necessary.
The variable precision floating-point circuit determines the certainty of the result of a multiply add floating-point calculation in parallel with the floating-point calculation. The variable precision floating-point circuit uses the certainty of the inputs in combination with information from the computation, such as, binary digits that cancel, normalization shifts, and rounding, to perform a calculation of the certainty of the result. A variable precision floating point circuit includes a variable precision mantissa unit that supports multiple precisions, multiple exponent data paths that support a maximum parallelism at a lowest precision, and certainty calculation units that provide certainty bounds of the outputs.
On processors according to the state of the art as described above, 128 bit floating-point operations are emulated in software. The described methods are usually one to two orders of magnitude slower than a hardware implementation, which make them less attractive for Big Data Analytics.