Floating point operands are commonly used by data processors. A floating point operand has a mantissa and an exponent and a sign as defined, for example, by the IEEE 754 standard. Data processors commonly perform a multiply and add or accumulate operation wherein a product of two operands is subsequently added to a third operand. To acquire higher performance and higher precision in performing this operation, a merging or fusing of the two mathematical operations has been implemented wherein a portion of the addition of the third operand is begun prior to completion of the multiplication of the first and second operands. As operating frequencies have increased and continue to increase, merged ‘multiply and accumulate’ functions require increasingly longer latencies or delay to compute. The reason for this is that there have been fewer fundamental advances in how to implement the multiply/accumulate function. Therefore, as the clock cycle length shortens, the latency or number of clock cycles to implement the function increases.
A traditional fused multiply/add microarchitecture multiplies two operands while simultaneously bit aligning a third operand to be added. The latency of the shift operation is therefore hidden by latency associated with the multiplication operation. The savings of the bit shifting latency therefore made this architecture popular. The result may require normalization due to the possibility of massive cancellation of the operands in an effective subtract operation resulting in a number of leading zeros in the mantissa of the result. A remaining operation in the form of a rounding operation is lastly required to provide the resultant. It should be noted that this microarchitecture requires sequential steps associated with multiplication, addition, normalization and rounding. An example of this microarchitecture is shown by R. K. Montoye et al. in an article entitled “Design of the IBM RISC System/6000 Floating-Point Execution Unit”, IBM J. RES. DEVELOP., Vol. 34 No. 1, January 1990. This information is also disclosed in U.S. Pat. No. 4,999,802.
Another issue associated with pipelined multiplier/accumulators is the processing of two sequential operations wherein a second of the operations requires a result from a first of the operations. This condition is known as a data dependency. When a data dependency exists with a pipelined execution unit, the introduction of the second set of operands must wait the entire latency of the execution unit pipeline associated with the time required for the first operation to complete.
One method to reduce execution unit latencies of dependent operations is shown by R. K. Montoye et al. in an article entitled “Design of the IBM RISC System/6000 Floating-Point Execution Unit”, IBM J. RES. DEVELOP., Vol. 34 No. 1, January 1990. This method eliminates the rounding latency by forwarding a dependent operand prior to rounding back to the floating-point unit and performing the operand increment in a multiplier array.
A latency reduction technique specific to addition operations recognizes that right-shifting of a first addend and normalizing the resulting sum can be mutually disjoint, depending upon the exponent difference and the possibility of massive cancellation of leading edge zeroes in the sum. For addition operations in which the exponents of the addends differ in magnitude by at most one bit, a condition referred to as “Near”, the sum may require normalization but the first addend does not require right-shifting. For addition operations in which the exponents of the addends differ in magnitude by more than one bit, a condition referred to as “Far”, the sum does not require normalization because the possibility of large numbers of leading edge zeroes does not exist, but the first operand may require shifting. Consequently, latency associated with the addition may be reduced by using two paths. One path is associated with the Near condition and one path is associated with the Far condition. In the Near path, normalization occurs but no significant (i.e. greater than one bit) addend shifting is performed. In the Far path, addend shifting is implemented but no normalization is performed. Consequently, latency is reduced because both addend right shifting and normalization never occur simultaneously. Note also that this technique does not work for a fused multiply/add operation because conditions may exist in which addend shifting and normalization are both required simultaneously.
Another floating point latency reduction technique is shown by A. Beaumont-Smith et al. in “Reduced Latency IEEE Floating-Point Standard Adder Architectures”, ChiPTec, Department of Electrical and Electronic Engineering, The University of Adelaide, Adelaide, 5005, Australia. A. Beaumont-Smith et al. show the incorporation of the rounding function into an adder that sums the partial products from the multiplier array. This technique is referred by A. Beaumont-Smith et al. as “Flagged-Prefix Addition”. Unnormalized results from the adder are forwarded as inputs to the floating point pipeline. The structure is unable to perform both multiplication and addition.
Wolrich et al. teach in U.S. Pat. No. 5,694,350 a rounding adder for a floating point processor. Rounding is performed prior to normalization rather than after by incorporating the rounding function in the adder. Latency may therefore be reduced. Another example of incorporating rounding prior to a normalization step is provided by S. Oberman et al. in “The SNAP Project: Towards Sub-Nanosecond Arithmetic”, Proceedings IEEE 13th International Symposium on Computer Arithmetic, pgs. 148-155, Asilomar, Calif., July 1997.
Skilled artisans appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help improve the understanding of the embodiments of the present invention. Elements that are common between the figures are given the same element number.