A high performance floating-point unit (FPU) is a major component of modern central processing unit (CPU) and graphical processing unit (GPU) designs. Its applications range from multimedia and 3D graphics processing to scientific and engineering applications. Most recent designs incorporate an integrated multiply-accumulate unit due to the frequency of multiply-accumulation operations, such as those found in dot products and matrix multiplication. These multiply-accumulate units usually implement the fused multiply add operation (FMA) which is part of the IEEE floating-point arithmetic standard, IEEE741-2008. The standard defines fusedMultiplyAdd(A, C, B) as the operation that computes (A×C)+B initially with unbounded range and precision, rounding only once to the destination format. As a result, fused multiply-add has lower latency and higher precision than a multiplication followed by an addition.
FMA units reduce latency by performing alignment of the addend significand (SB) in parallel with the multiplication tree of the other two significands, SA and SC. Furthermore, the multiplier output is kept in carry save format and added to the aligned addend, thereby saving one extra add operation. However, since the addend's exponent might be smaller or larger than, the sum of multiplicands' exponents, the addend significand can be shifted from all the way to the left of the multiplier result to all the way to the right, requiring the datapath to have wide shifters and adders. Another problem with traditional FMA design is that it does not make a distinction between the latency of accumulation and multiply-add, resulting in designs that have equal latencies for all dependencies. In view of the above, there is a need in the art for improved FMA units.