1. Technical Field
The present invention generally relates to processing floating-point arithmetic operations, and in particular to a method and system that eliminates the need for rounding floating point arithmetic results. More particularly, the present invention relates to a method and system for incorporating a random carry value into a floating-point mantissa operand within computer-driven addition, subtraction, and multiplication operations.
2. Description of the Related Art
For many data processing computations, the range of numbers utilized as operands, intermediate shifted results, and final results is very large. An effective technique for expanding and expressing such large numeric ranges is to express them in floating-point notation. A floating-point number is typically represented in a computer in three parts. The first part is a sign bit indicating the whether the number is a positive or a negative value. The second part contains a fraction, often referred to as the mantissa, and the third part designates the position of the radix point and is called the exponent. For example, the decimal number +5123.678 is represented in floating-point notation as +0.5123678 (fraction) and +04 (exponent). The value of the exponent in this example indicates that the actual position of the decimal point is four positions to the right of the indicated decimal point in the fraction. This representation is equivalent to the scientific notation expression +0.5123678×10+04. Floating-point representation is useful in many computer-aided computation applications because it increases the range of numbers that can be accommodated by limited register capacities.
Implementation of floating-point processing systems requires specialized hardware and software capable of implementing floating-point operations in a computer system that is primarily designed to process integers. A standard for binary floating-point arithmetic as implemented within many data processing systems is promulgated by the American National Standard Institute (ANSI) as ANSI/IEEE standard 754-1985, which is incorporated herein by reference. Among the issues addressed in this standard is an approach to rounding floating-point computation results. Rounding takes a number that is otherwise regarded as infinitely precise, such as a floating-point mantissa, and if necessary, modifies it to fit the logistical limitations of the in-memory representation specified by the standard.
The need for rounding floating-point results is particularly evident in computer-aided multiplication operations performed within digital signal processors (DSPs) and floating-point units implemented within minicomputers and microcomputers. In the case of DSP signal processing, floating-point addition/subtraction and multiplication circuits are utilized to perform high-speed addition, subtraction, and multiplication operations.
A block diagram of a conventional floating-point adder in which rounding of floating-point results is utilized is illustrated in FIG. 1 as floating-point adder 100. As depicted in FIG. 1, floating-point adder 100 includes a pair of registers, 102 and 104, which contain the addend and augend mantissa values, Y and X, respectively, of the input addition operands. To properly align the mantissa bits of the operands, a sign and exponent processing unit 115 maintains the respective sign and exponent values for each of the input operands. In the depicted example, such alignment is achieved by shifting the mantissa value (either Y or X) having a smaller exponent value in an alignment shifter 106 in accordance with relative exponent information received from sign and exponent processing unit 115. The relative position of the bits for the mantissa value which has a smaller corresponding exponent is shifted to the right within alignment shifter 106 by the difference in exponent values between the two operands. In order to maintain acceptable precision, operand alignment shifter 106 must maintain sufficient bit positions to accommodate the shift without losing the original mantissa bit values.
The aligned mantissa operands are added within a carry propagate adder 108 and the resultant sum is applied to a leading zero detector 110 and a result normalize shifter 112 as well as sign and exponent processing unit 115. Leading zero detector 110 detects the number of leading zeros within the sum value within carry propagate adder 108 and delivers a corresponding exponent adjustment value to sign and exponent processing unit 115 such that the bit positions of the mantissa sum value can be shifted within result normalize shifter 112 by the number of bits specified in the adjustment value as required by the binary point convention implemented by floating-point adder 100.
The intermediate shifting function of operand alignment shifter 106 may need to shift the operand having a smaller exponent by up to the difference between the maximum representable exponent and the minimum representable exponent. Implementations of IEEE compliant adders extend the width of the shifted operand by two places and logically OR the mantissa bits which have been shifted beyond this extended register. The result of the ORing of the bits shifted beyond the two additional positions is referred to as a “sticky bit.”
Thus, in order to accurately compute the final mantissa value, the adder must be three bits wider than the desired result width and additional register space is required to accommodate the additional bits in the shifted operand mantissa. In addition, the sticky bit must be computed and maintained during the add. Substantial hardware and processing overhead are required to accommodate this approach to maintaining floating-point accuracy. Referring back to FIG. 1, a rounding circuit 114 is required to reduce the relatively large number of bits in the floating-point mantissa result within result normalize shifter 112 to conform to a predetermined mantissa convention within a result register 116.
With reference to FIG. 2, there is depicted a conventional floating-point multiplier 200 that, like floating-point adder 100 requires rounding of the mantissa result. In the example depicted in FIG. 2 an n-bit multiplicand mantissa, Y, is multiplied by an n-bit multiplier mantissa, X, with a sign and exponent processing unit 225 maintaining the respective sign and exponent values for each of the input operands. A pair of registers, 202 and 204, store multiplicand multiplier mantissas Y and X, respectively, until floating-point multiplier 200 receives a “multiply” instruction whereupon registers 202 and 204 deliver the mantissa operands to the multiplication circuitry within floating-point multiplier 200.
The functionality of floating-point multiplication circuit 200 may be divided into two stages. The first stage includes a partial product generator 206 and a partial product reduction circuit 208, wherein partial product generation and carry-save addition (reduction) are performed. The logic required to implement the functionality of partial product generator 206 is well-known in the art and is therefore not explicitly depicted in FIG. 2. The second stage includes a carry propagate adder 214 that adds the redundant-form product and to generate a binary product.
Partial product generator 206 processes multiplicand mantissa Y and multiplier mantissa X to produce a partial product matrix (PPM) such as PPM 302 illustrated in FIG. 3. The individual partial product terms (represented as dots within PPM 302) are generated by partial product generator 206 in accordance with an algorithm selected from among many suitable algorithms known to those skilled in the art. In the depicted example, PPM 302 is an n-by-n array reflecting n-bit input mantissa operands.
Referring to FIG. 2 in conjunction with FIG. 3, PPM 302 is sequentially compressed utilizing a series of offset additions into a first reduced array 304 and a second reduced array 306 by a series of counter/compressors 210a-210n within partial product reduction unit 208. Assuming n-bit operands, counter/compressors 210a-210n must include at least 2n-bit wide registers to process PPM 302. Second reduced array 306 comprises the redundant (carry-save) form of a sum row 308 and a carry row 307, which are added within carry propagate adder 214 to form a 2n-bit product result 310. Similar to floating-point adder 100, floating-point multiplier 200 includes a result normalize shifter 216 that shifts result mantissa 310 in accordance with leading zero information from a leading zero detector 212.
The exponent value for result mantissa 310 is adjusted accordingly by sign and exponent processing unit 225. Similar to the sum mantissa result within result normalize shifter 112 of floating-point adder 100, product mantissa result 310 within normalize shifter 216 may require rounding to conform to a predetermined mantissa convention within a result register 220.
As applicable to the adder and multiplier implementations depicted in FIGS. 1 and 2, one approach to rounding floating-point results is to simply truncate the result at a designated bit position. This approach is generally unacceptable for most DSP and advanced floating-point operations because it introduces a bias toward zero over multiple operations. An alternate approach to floating-point rounding, known as “round to nearest,” is set forth in ANSI/EEEE standard 754-1985. In accordance with the round to nearest rounding technique, the representable value nearest to the infinitely precise result is delivered as the result. Furthermore, if the two nearest representable results are equally near, the one having a least significant bit zero is delivered as the result.
The round to nearest rounding technique is both accurate and avoids any systematic reduction or inflation of the absolute value of a floating-point mantissa result. However, a considerable cost in terms of additional computation circuitry and latency is incurred in implementing the round to nearest rounding technique. The additional hardware overhead and latency is particularly problematic when imposed on already expensive multiplier designs having a computation and hardware-intensive nature.
From the foregoing, it can be appreciated that a need exists for an improved method and system for producing floating-point results in which acceptable precision is maintained within a limited number of bit positions. The present invention addresses such a need.