Modern processors often include instructions to provide operations that are computationally intensive, but offer a high level of data parallelism that can be exploited through an efficient implementation using various data storage devices, such as for example, single instruction multiple data (SIMD) vector registers. In some alternative processors instructions may provide fused operations, such as for example, multiply-add operations. In some alternative processors both types of instructions may be provided and/or combinations of both types may be provided in individual instructions, such as for example, SIMD multiply-add operations.
Some processors in the past have implemented instructions to perform fused floating-point multiply-add operations. For example, in 1990, IBM implemented fused floating-point multiply-add operations on a RISC System 6000 (IBM RS/6000) processor. Some applications, for example involving computation of dot products, could make use of the new instructions to improve performance. But since the width of floating-point hardware required to support such operations may be at least twice the width of standard floating-point multipliers and adders, one floating-point multiply-adder could take up as much area as two floating-point multipliers and two floating-point adders. Therefore, the fused floating-point multiply-adder might completely replace the individual floating-point multipliers and floating-point adders, and the fused floating-point multiply-adder might be used to emulate an individual floating-point multiply and/or an individual floating-point add, but at some (perhaps significant) performance cost. For legacy applications, which were not recompiled, or for applications that could not make use of the fused floating-point multiply-add operations, there was (perhaps significant) performance degradation.
Some other processors in the past have implemented instructions to almost perform fused floating-point multiply-add operations. For example, in 2001, the HAL SPARC64 implemented pseudo-fused floating-point multiply-add operations, by bypassing results from the floating-point multipliers to the floating-point adders. While this approach did not suffer the same performance degradation for legacy applications, which were not recompiled, or for applications that could not make use of the fused floating-point multiply-add operations, the width of the floating-point multipliers, bypasses and floating-point adders was not sufficient to provide the same improved accuracy as a true fused floating-point multiply-add operation.
In 2008, the Institute of Electrical and Electronics Engineers (IEEE) issued a revised floating-point standard IEEE Std 754™-1985, IEEE Std 754-2008, which includes fused multiply-add (FMA) and fused multiply-subtract (FMS) operations. The IEEE standard specifies an improved accuracy of a true, IEEE fused floating-point multiply-add operation without rounding between the multiplication and the addition. While standardization will undoubtedly prompt new processors, offering the IEEE FMA and FMS operations, the previously mentioned issues of performance degradation and increased die area remain.
To date, potential solutions to such performance limiting issues, area tradeoff issues, as well as related power issues, and the need for recompilation, have not been adequately explored.