A processing core can include multiple data processors that execute program instructions by performing various arithmetic operations. Examples of arithmetic operations that can be performed by arithmetic processing units (APUs) of such processors include addition, multiplication, division, and the like. In addition, some APUs can support more complex operations. For instance, one example is a multiply-and-accumulate (MAC) operation that computes the product of two numbers and adds that product to another number. The numerical format of the numbers used in such a computation can vary depending on the implementation. Two common numerical formats are integer format and floating point format.
Floating Point (FP) Number Processing
Some data processor devices may include a specialized arithmetic processing unit called a floating-point (FP) processing device that can operate on operands that have a floating point numerical format. FP arithmetic is widely used for performing tasks such as graphics processing, digital signal processing, and processing associated with scientific applications. A FP processing device generally includes devices dedicated to performing specific operations with respect to floating point numbers, such as addition, multiplication, and division. These fundamental operations can be referred to herein as floating point add (FADD), floating point multiply (FMUL), floating point divide (FDIV), respectively.
Floating Point (FP) Multiply-and-accumulate (MAC) Operations
In addition, some APUs can be designed to support more complex FP operations such as a FP MAC operation. In a FP MAC operation, two FP operands (A and B) are multiplied and the product is added to a third FP operand (C) to generate a result. When a MAC operation is done with floating point numbers, the MAC operation can either be performed using two rounding steps, or using a single rounding step. Because floating point numbers have only a certain amount of mathematical precision, it makes a difference in the result whether the MAC operation is performed with two roundings, or with a single rounding.
Fused Multiply-and-accumulate (FMAC) Operation
When a FP MAC operation is performed with a single rounding, this operation is commonly referred to as a fused multiply-add (FMADD) or fused multiply-and-accumulate (FMAC). In other words, the entire sum C+A×B is computed to its full precision before rounding the final result down to N significant bits. In comparison to a processor that requires for a distinct multiply instruction to be performed followed by a distinct add instruction, a processor that includes an FMAC instruction in its instruction set may improve the speed and accuracy of many important computations that involve the accumulation of products, such as matrix multiplication, dot product calculation, or polynomial expansion. The FMAC operation may improve the accuracy because the result can be generated by performing a single rounding of the result rather than the two rounds that must be performed in the case of a distinct multiply instruction followed by a distinct add instruction. In the latter case, the product of the multiply is rounded; whereas, the FMAC instruction need not round the product before adding it to the third operand. Additionally, the FMAC instruction may improve the speed because a single instruction can generally be executed faster than two instructions.
Operand Formats
A floating-point processing device typically supports arithmetic operations on operands that use different number formats, such as single-precision, double-precision, and extended-precision formats. In addition, some floating-point processing devices support arithmetic operations on operands having a packed single-precision number format. An operand that has a packed single-precision number format contains two individual single-precision values.
It would be desirable to provide arithmetic processing devices and methods for implementing the same that can accurately, efficiently and quickly execute a fused multiply-and-accumulate instruction with respect to floating-point operands that have packed-single-precision format. It would also be desirable to speed up computation of a high-part of a result during a fused multiply-and-accumulate operation so that cycle delay can be reduced.