1. Field of the Invention
The present invention relates to floating-point multiplication and accumulation. More particularly, but without limitation, the present invention relates to floating-point multiplication and accumulation units (fmac) that perform floating-point multiplication and accumulation operations, such as (A.times.B)+C, with significant temporal overlap of suboperations, as explained below.
2. Description of Related Art
Typically, an fmac is a part of a floating-point unit of a microprocessor. An fmac generally performs as one operation floating-point multiplication and accumulation operations, such as (A.times.B)+C. In microprocessor design, fmac's play an important role in, for example, graphics applications. Basically, fmac's reduce the instruction path length by combining two instructions into one. Graphics applications often involve coordinate transformations, which are implemented as matrix transformations. By performing (A.times.B)+C as a single functional operation instead of a separate multiplication and addition, an fmac can increase the speed with which such matrix transformations are performed. For instance, when running at peak condition, an fmac can perform floating-point operations faster by a factor of two, compared to a unit performing separately the addition and multiplication in the operations (A.times.B)+C.
FIG. 1 illustrates a prior art fmac design. In FIG. 1, a binary floating-point number, A 110, is multiplied by another binary floating-point number, B 114. The result of this multiplication is then added to another binary floating-point number, C 118. A floating-point number includes a sign, a signed exponent, and a mantissa. FIG. 1 illustrates the case of A 110, B 114, and C 118, being m-bit numbers, i.e, each having an m-bit mantissa.
In this prior art design, the multiplication of A 110 and B 114, is performed using a multiplication array 122. The multiplication array 122 produces the partial products formed by multiplying, for instance, the mantissa of A 110 by each digit of the mantissa of B 114. The result of the multiplication can be up to a 2 m-bit number, since A 110 and B 114 are m-bit numbers. While the multiplication of A 110 and B 114 is occurring, C's 118 mantissa is pre-normalized 126. The pre-normalization 126 of C's 118 mantissa brings into agreement the exponent of C 118 with the exponent of the product of A 110 and B 114.
Next, a 3 m-bit carry propagation adder 130 adds the product of the multiplication of A 110 and B 114, to the pre-normalized number C 118. The more bit positions an adder has to process, the slower it operates and the more hardware it requires to implement. Nevertheless, this prior art design employs a 3 m-bit CPA 130, because conceivably each bit of the mantissa of C 118 is more significant than any of the bits of the 2 m-bit mantissa of the product of A 110 and B 114. Therefore, an adder adding the normalized C 118 to the product of A 110 and B 114, requires a capacity to handle a 3 m-bit wide mantissa 134.
In parallel with the addition performed by the 3 m-bit adder 130, a leading one/zero anticipator 138 operates on the numbers which are being added by the adder 130 to predict the bit position of the most significant bit of the number resulting from the addition. Actually, a leading one detector is less hardware intensive and faster than the leading one/zero anticipator 138 and also error free unlike the anticipator 138, as explained below. However, this prior art fmac uses the anticipator 138, because it needs to find the most significant bit of the output of the adder 130, while the adder 130 is executing, i.e. in parallel with the addition performed by the adder 130.
The leading one/zero anticipator 138 predicts the location of the most significant one bit in the result of the addition, if the result is positive, and of a leading zero if the result is negative. Predicting the leading zero can result in an error of the most significant bit being predicted one bit position too far to the right. For instance, the anticipator 138 may predict the leading zero of a negative number to appear in the fifth position from the left, but due to a carry generated when taking the twos complement of the negative number, the actual most significant bit ends up being in the fourth position from the left.
Next, the 3 m-bit mantissa output of the 3 m-bit adder 130 is normalized by a 3 m-bit normalizer 142. The normalizer 142 keeps up to m of the most significant bits and discards up to 2 m of the least significant bits of the 3 m-bit mantissa output of the CPA 130. Only an m-bit mantissa is kept, which may have up to m significant bits. In performing the normalization 142, this fmac uses the prediction of the leading one/zero anticipator 138 to determine the position of the most significant bit in the 3 m-bit mantissa output of the 3 m-bit adder 130.
If the m-bit output of the normalizer 142 is negative, then the prior art design of FIG. 1 requires taking the twos complement of the normalized 142 3 m-bit number. To generate the twos complement, the result of the normalization 142 is inverted (not shown) and incremented 146 (that is, the incrementer 146 adds a one to the m-bit number, which is output by the normalizer 142). Then, in some cases, when there is the error in the prediction of the most significant bit position, as explained above, this prior art design applies a binary shift 150 to the m-bit mantissa of the output of the incrementer 146. When activated, the binary shifter 150 shifts the m-bit mantissa to the right by one bit and adds a one to the exponent of the m-bit floating-point number.
Following step 150, the result 154 of the operations (A.times.B)+C is available. The total execution time for this prior art fmac design is approximately the time needed for executing the following: (m-bit.times.m-bit multiplication)+(3 m-bit addition)+(3 m-bit normalization)+(m-bit increment)+(binary shift).