Aspects of the present invention relate in general to data processing systems, and in particular, to performing binary fused multiply-add floating-point calculations.
The IEEE-754-2008 Standard for Binary Floating Point Arithmetic, published in 2008, specifies a floating point data architecture that is commonly implemented in computer hardware, such as floating point processors having multipliers. The format consists of a sign, an unsigned biased exponent, and a significand. The sign bit is a single bit and is represented by an “S”. The unsigned biased exponent, represented by an “e”, is, e.g., 8 bits long for single precision, 11 bits long for double precision and 15 bits long for quadruple precision. The significand is, e.g., 24 bits long for single precision, 53 bits long for double precision and 113 bits long for quadruple precision. As defined by the IEEE-754-2008 standard, the most significant bit of the significand, i.e. the so called implicit bit, is decoded out of the exponent bits.
To improve floating-point arithmetic processing, most modern processors use a process called the fused-multiply-add (in the following abbreviated as FMA) process to combine a floating point multiplication operation, e.g., A*B, and a floating point addition operation, e.g., +C, for execution as a single instruction, e.g., A*B+C, where A, B, C are operands of the multiplication product A*B and the sum of C and the product. By performing two operations in a single instruction, the FMA process reduces overall execution time. The FMA process also provides improved precision because rounding need only be performed after both the multiplication and addition operations are performed at full precision. For instance, there is only one rounding error instead of two.
In floating point processors, one central area is the multiplier array. The multiplier array is used to do multiplication of two numbers. Usually state-of-the-art Booth's encoding with radix 4 is employed, which is a commonly used fast multiplication algorithm. This reduces the number of product terms that need to be summed up to n/2+1, where n is the number of bits per operand. The summation is done using a carry-save-adder circuitry which allows processing of all bits in parallel, as opposed to the normal addition where the carry-out of the lower bit position is chained to the next higher position, which is performed usually by a carry-propagate-adder circuitry. The circuitry that does this summation is known in the art as a reduction tree. At the end of the reduction tree, there remain two terms, the sum term and the carry term, which represent a summation part of information and a carry part of information, respectively. These terms finally are added with the aligned addend. Again, a carry-save-addition is performed here. Finally, only two terms remain, also a sum term and a carry term, and these two terms are added using the carry-propagate-adder to generate one final result.
Analytics applications, especially when running on big data amounts, are very compute intensive. Their main data types are binary floating-point. This includes commercially available analytics software, like ILOG, SPSS, Cognos, Algo, and many specialized analytics packages for the insurance and banking sectors.
Many mobile applications require location detection routines, which also are floating-point intensive calculations. Performance of these routines are key in emerging sectors, like telematics, which combines mobile input with database queries and insurance analytics codes and has realtime requirements.
For both areas, the latency and throughput of the floating-point unit (FPU) greatly matters for the performance of the applications. For fraud detection, for example, the FPU performance is key in whether the detection is near realtime or not.
The IEEE 754-2008 standard makes FMA (fused-multiply-add, i.e. A*B+C) mandatory for modern floating-point units. Since introduction of the IEEE 754-2008 standard, there is great competition for the fastest FPU with FMA support.
US 2014/0188966 A1 and similarly U.S. Pat. No. 8,892,619 B2, each of which is hereby incorporated herein by reference in its entirety, disclose a floating-point fused multiply-add (FMA) unit embodied in an integrated circuit. The FMA unit includes a multiplier circuit with floating point inputs A and C and floating point output A*C, and an adder circuit connected to the output of the multiplier circuit. The adder circuit adds the floating point output A*C to a floating point input B producing a result A*C+B. The adder circuit includes an exponent difference circuit implemented in parallel with the multiplier circuit, a close path circuit implemented after the exponent difference circuit, a far path circuit implemented after the exponent difference circuit, a 2:1 multiplexer (Mux) circuit connected to outputs of the close path circuit and the far path circuit and a rounder circuit connected to an output of the 2:1 multiplexer circuit. The FMA unit also includes accumulation bypass circuits forwarding an unrounded output of the 2:1 multiplexer circuit to inputs of the close path circuit and the far path circuit, and forwarding an exponent result in a carry save format to an input of the exponent difference circuit. Also included in the FMA unit is a multiply-add bypass circuit forwarding the unrounded output of the 2:1 multiplexer circuit to the floating point inputs A and C of the multiplier circuit. According to the disclosures of US 2014/0188966 A1 and U.S. Pat. No. 8,892,619 B2 it is required that the rounding correction for the B operand (addend) happens before the addend enters the alignment shifter. The accumulation bypass circuits are staggered with respect to each other such that the exponent data path starts earlier than the mantissa data path and an exponent result is calculated earlier than a mantissa result in a carry save format to be forwarded for a next dependent calculation. The far path circuit combines incrementing with a shifting and addition data path. The unrounded output is right padded with ones in case of incrementing before being fed to a shifter or adder, and circuitry in the adder circuit uses carry propagation as if the unrounded result had been incremented before shifting and addition.
Thus, US 2014/0188966 A1 discloses a fused multiply-add on a cascaded multiply-add pipeline. The FMA unit requires a special floating point adder structure for forwarding an unrounded output of the 2:1 multiplexer to the floating point input A and of the multiplier circuit.
U.S. Pat. No. 8,671,129 B2, hereby incorporated herein by reference in its entirety, discloses a processing unit for performing a multiply operation in a multiply-add pipeline. To reduce the pipeline latency, the unrounded result of a multiply-add operation is bypassed to the inputs of the multiply-add pipeline for use in a subsequent operation. If it is determined that rounding is required for the prior operation, then the rounding occurs during the subsequent operation. During the subsequent operation, a Booth encoder not utilized by the multiply operation outputs a rounding correction factor as a selection input to a Booth multiplexer not utilized by the multiply operation. When the Booth multiplexer receives the rounding correction factor, the Booth multiplexer outputs a rounding correction value to a carry save adder (CSA) tree, and the CSA tree generates the correct sum from the rounding correction value and the other partial products.
The processing unit disclosed in U.S. Pat. No. 8,671,129 B2, works on a fused multiply-add algorithm. U.S. Pat. No. 8,671,129 B2 describes performing the correction on multiplier inputs as well as necessary modifications of the Booth structure. However, it does not describe how to apply a rounding correction to the addend.
The FMA unit design, disclosed in US 2014/0188966 A1, requiring a special adder structure, consisting of a far and close path, is also described in the publication by Sameh Galal, Mark Horowitz, “Latency Sensitive FMA Design”, published in 2011 in the 20th IEEE Symposium on Computer Arithmetic, which is hereby incorporated herein by reference in its entirety, applied to a cascaded fused multiply-add pipeline.
U.S. Pat. No. 8,990,282 B2, which is hereby incorporated herein by reference in its entirety, discloses a fused multiply add floating point unit that includes multiplying circuitry and adding circuitry. The multiplying circuitry multiplies operands B and C having N-bit significands to generate an unrounded product B*C. The unrounded product B*C has an M-bit significand, where M>N. The adding circuitry receives an operand A that is input at a later processing cycle than a processing cycle at which the multiplying circuitry 4 receives operands B and C. The adding circuitry 8 commences processing of the operand A after the unrounded product B*C is generated by the multiplying circuitry 4. The adding circuitry 8 adds the operand A to the unrounded product B*C and outputs a rounded result A+B*C.
Thus, in U.S. Pat. No. 8,990,282 B2 in case of a fused multiply add operation, the multiplier is forwarding an unrounded product to the adder part, being a specialty of a cascaded multiply and add design. Further, in a second kind of forwarding, a rounded result is forwarded back to the A input, i.e. the addend. The result provided by the disclosed multiply add unit is always fully rounded; this can be the result of a multiply, an add, a subtract, or a multiply-add operation. The operands received by the disclosed unit are also fully rounded. There is no unrounded forwarding of results from a prior operation to an operand of a subsequent operation.
U.S. Pat. No. 8,977,670 B2, which is hereby incorporated herein by reference in its entirety, discloses implementing an unfused multiply-add instruction within a fused multiply-add pipeline. The system includes an aligner having an input for receiving an addition term, a multiplier tree having two inputs for receiving a first value and a second value for multiplication, and a first carry save adder (CSA), wherein the first CSA may receive partial products from the multiplier tree and an aligned addition term from the aligner. The system includes a fused/unfused multiply add (PUMA) block which receives the first partial product, the second partial product, and the aligned addition term, wherein the first partial product and the second partial product are not truncated. The PUMA block performs an unfused multiply add operation or a fused multiply add operation using the first partial product, the second partial product, and the aligned addition term, e.g., depending on an opcode or mode bit. Forwarding of a result of an operation back into a next operation is only possible from a multiply to the input of an add operation. The forwarded product is in an intermediate not normalized format. For the unfused multiply-add, the unrounded and unnormalized product gets internally in the FMA pipeline passed (forwarded) to the add operation, the product will require a rounding correction. However, the forwarding is very limited: it is only from the product to one operand of the addition which does not undergo alignment. There is no unrounded forwarding from an add or multiply-add to any input operand of a subsequent operation, and there is no unrounded forwarding from the product to the add term which undergoes the alignment shift.