1. Field of the Invention
The present invention relates to a floating-point adder-subtractor with inputs of three terms, and more particularly, to a three-term input floating-point adder-subtractor capable of obtaining the same result as that obtained when an operation is carried out with infinite precision in an operation process and followed by rounding.
2. Description of the Related Art
Today, floating-point operation is used for scientific and technological calculations and graphic processing or the like, and therefore many microprocessors incorporate a floating-point operation unit. Among these operations, addition and subtraction are basic operations and binary operation units which carry out operation on two inputs are generally used.
Floating-point operations are defined by IEEE754 which is a standard, and are used widely. The IEEE standard requires that addition and subtraction with inputs of two terms should be treated as if having infinite precision in an operation process and then rounded in accordance with the standard. A method of carrying out operation by adding three bits of a guard bit, round bit and sticky bit to lower digits of a mantissa is known as a general method for realizing infinite precision in an operation process using an operation unit having only a finite bit width and is described, for example, in Milos D. Ercegovac “Digital Arithmetic”.
FIG. 12 illustrates states of mantissas in a binary operation process defined by the standard IEEE754.
Now, suppose the mantissas of a first operand and second operand before an operation are as shown in FIG. 12(a). “x” shown in all drawings of the present invention which will be explained hereinafter is an arbitrary value of 0 or 1. In a binary operation, a difference in exponents between two terms of the aforementioned first operand and second operand is calculated and the larger exponent is assumed to be an exponent of intermediate value in that case. The mantissa of the operand corresponding to the smaller exponent is right-shifted by the difference in exponents between the aforementioned two terms. In that case, as is conventionally well known, a right shift is made after adding three bits of a guard bit, round bit and sticky bit to the lower digits of the mantissa in order to enable rounding to be done precisely. However, suppose the sticky bit is a logical OR with all discarded bits. In this case, the mantissas of the aforementioned two terms are as shown in FIG. 12(b).
Next, an operation is carried out using a binary adder-subtractor on the above described right-shifted bit sequence and the bit sequence having the larger exponent. The result of this addition and subtraction is left-shifted until the most significant bit (MSB) becomes 1 and then normalized. Furthermore, the exponent of intermediate value is updated at the same time. Next, rounding is performed in a rounding mode compliant with the IEEE standard. The mantissa obtained by left-shifting the result of addition and subtraction until the most significant bit (MSB) becomes 1 and normalizing the result and the mantissa resulting from rounding are as shown in FIG. 12(c).
In the case of a binary operation, carrying out the above described addition and subtraction can obtain the same result as that when carrying out operations with infinite precision in a calculation process and then rounding.
FIG. 13 illustrates methods of rounding modes compliant with the IEEE standard. The IEEE754 defines four types of rounding mode of RN (nearest value), RZ (zero direction), RP (+infinity direction) and RM (−infinity direction). FIG. 13 shows a relationship between the sign of a result of binary addition, values of least significant bit (LSB), guard bit, round bit and sticky bit of a mantissa, rounding up and rounding down as a way of rounding in each rounding mode. Since these are well known contents, further explanations thereof will be omitted here.
In contrast to the above described addition and subtraction operation with inputs of two terms, addition and subtraction operation with inputs of three terms (1) can enhance, in a case of processing of carrying out multiple additions such as computing the sum total of certain vectors, the speed of processing by adding three terms rather two terms at a time and (2) while Knuth's formula for calculating the sum of two floating-points with no errors (see Donald E. Knuth “The Art of Computer Programming: Seminumerical Algorithms (Vol 2, 3rd Ed)”) requires six floating-point additions and subtractions, if three-term input floating-point addition can be used, only two floating-point additions and subtractions are enough and there is an advantage of being able to enhance the speed of multiple-precision operation which makes heavy use of such operations.
FIG. 14 illustrates how two-term input operation and three-term input operation are carried out to calculate the sum x+y of floating-point numbers x and y with no errors according to Knuth's formula. Operation using an operation unit with inputs of two terms requires six floating-point additions and subtractions. On the other hand, carrying out the same operation using an operation unit with inputs of three terms requires only two floating-point additions and subtractions.
However, in comparison with the two-term input operation, using the three-term input operation makes it difficult to obtain a precise result in floating-point addition and subtraction. Here, the “precise result” refers to producing the same result as that when an operation is carried out with infinite precision and then rounding is performed in an operation process.
Next, a case where a precise result cannot be obtained when a two-term operation method is expanded to a three-term operation just as is will be explained. In the following explanations, suppose the first operand of three-term input is a, the second operand is b and the third operand is c.
FIG. 15 illustrates an example where a precise result cannot be obtained because two out of three terms cancel out each other.
Suppose a case where two out of three terms cancel out each other in an operation, for example, a case where a=−b, |a|, |b|>|c|. In this case, since a and b cancel out each other, the correct result should be c. However, since the absolute value of c is smaller than a and b as shown in FIG. 15, the shifted and discarded bits are summarized in the sticky bits, which causes information necessary for the precise result to be lost, and therefore a precise result cannot be obtained by simply extending the calculation method of an operation with two terms to an operation with three terms just as is.
FIG. 16 illustrates an example where loss of significant digits in an operation with two out of three terms having large absolute values prevents a precise result from being obtained.
For example, suppose a case where a≈−b, |a|, |b|>|c|. Loss of significant digits occurs in an operation with a and b. As shown in FIG. 16, since the absolute value of c is smaller than a and b, the shifted and discarded bits are summarized in the sticky bits losing information necessary for the precise result, and therefore a precise result cannot be obtained by simply extending the calculation method of an operation with two terms to an operation with three terms just as is.
The example explained using FIG. 15 can be regarded as a special case of the example explained using FIG. 16, but the example shown in FIG. 15 results in a+b=0, and therefore the result must become c in a precise sense irrespective of the magnitude of the absolute value of c and all the bits of c are required. On the other hand, in the case of the example shown in FIG. 16, all the bits of c are not always necessary depending on the magnitude of the absolute value of c, and therefore the example shown in FIG. 15 and the example shown in FIG. 16 are treated as separate cases.
FIG. 17 illustrates an example where two out of three terms produce bit discards due to a shift, preventing a precise result from being obtained. The example explained here assumes that |a|>>|b|, |c|.
As shown in FIG. 17, a shift of both b and c produces a bit overflow and the discarded bits are grouped in the sticky bits. In this case, an operation is carried out between sticky bits and the operation may propagate to higher digits, preventing a precise result from being obtained. Even when the operation does not propagate to higher digits, information on the magnitude relationship between |b| and |c| is lost when |b|≈|c| or the like and rounding cannot be performed correctly in rounding modes of RZ (zero direction), RP (+infinity direction) and RM (−infinity direction) defined by the IEEE754 standard.
A technique descried in JP-A 9-204294 or the like is known as a prior art relating to a three-term input floating-point operation unit. This prior art provides a gap bit between a round bit and sticky bit to prevent propagation of carries to higher digits due to addition between sticky bits. However, when two out of three terms cancel out each other as described in FIG. 15 and when loss of significant digits occurs in an operation on two out of three terms having large absolute values as described in FIG. 16, precise results cannot be obtained. Furthermore, in the case of the example explained in FIG. 17 where two out of three terms produce bit discarded due to a shift, preventing a precise result from being obtained, there may be no problem when the rounding mode is RN (nearest value), but in the case of other rounding modes, the problem cannot be solved completely and it is not possible to obtain the same result as that when rounding is carried out after an operation is carried out with infinite precision in the operation process.
As described above, in the case of a three-term input operation, there is a problem that applying the method of binary operation just as is makes it difficult to obtain a precise result in floating-point addition and subtraction.
It is an object of the present invention to provide a three-term input floating-point adder-subtractor capable of solving the above described problems of the prior art and obtaining, even in cases where it is difficult to precisely carry out three-term input floating-point addition and subtraction, the same result as that when rounding is performed after an operation is carried out with infinite precision in the operation process.