In writing procedures to compute accurate floating-point function values, in particular correctly-rounded function values, it is desirable to carry out calculations in such a way that intermediate results are represented by more than one floating point value.
For example, in computing a value of a function by polynomial methods in which intermediate accuracy of about 150 bits is necessary, it is possible to represent a quantity with 150 bits of accuracy by three double precision values, each containing a 50-bit sequence. The value of the quantity is represented implicitly as a sum of the double precision values. The mathematical values of addition and multiplication are used in evaluating polynomials, so it is desirable to have efficient methods of performing these operations on numbers represented on the three double format. In taking the product of two such 3-double format numbers (a, b, c) and (d, e, f) to form a product represented by (g, h, i), FIG. 1 shows the different terms needed. Each 50 bit piece is represented by a box. Each intermediate product requires two boxes. Thus for example, the product of PC and PF is shown in FIG. 1 as two separate adjacent boxes entitle CF. The left hand box contains the more significant portion and the right hand box contains the less significant portion of the intermediate product.
A good approximation to the desired product can be obtained by dropping the low order 150 bits in computing:
i.sub.-- temp=high50(af)+high50(be)+high50(cd)+low50(ae)+low50(bd) PA1 i=low50(i.sub.-- temp) PA1 h.sub.-- temp=high50(ae)+high50(bd)+low50(ad)+high50(i.sub.-- temp) PA1 h=low50(h.sub.-- temp) PA1 g=high50(ad+high.sub.-- 50(h.sub.-- temp)) PA1 temp=x+big PA1 integer.sub.-- part=temp-big PA1 fractional.sub.-- part=x-integer.sub.-- part PA1 temp=x * y+really.sub.-- big PA1 high=temp-really.sub.-- big PA1 low=x * y-high PA1 x.sub.-- high=single(x) PA1 y.sub.-- high=single(x) PA1 x.sub.-- low=x-x.sub.-- high PA1 y.sub.-- low=y-y.sub.-- low PA1 xy.sub.-- high=x.sub.-- high * y.sub.-- high PA1 xy.sub.-- low=x.sub.-- high * y.sub.-- low+y.sub.-- high * x.sub.-- low.
The operations "high50" and "low50" compute the first 50 bits and the second 50 bits of a product or sum. The reason for choosing sequences of 50 bits in the particular case is that there is one column where 5 numbers are added together. By limiting each of these numbers to a 50-bit range, the sum is exactly representable in the double precision format, without overflow. This is because 2.sup.53 /2.sup.50 =2.sup.3 =8 and 8&gt;5.
Multiple precision arithmetic is often used in interval arithmetic in which the value of a quantity is guaranteed to lie in an interval between two different numerical values. A short interval indicates a number whose value is closely bounded, while a large interval indicates a number whose value is not well known. During normal arithmetic operations on numbers represented by intervals, the sizes of the intervals tend to grow, and multiple precision operations are often needed for a final result to have a tight interval.
In other instances, such as function evaluation, it is often desirable to be able to quickly scale the exponent of a number so that it lies within a certain range. For example, in computing the natural logarithm of a number "x", it is often desirable to be able to quickly scale "x" by a power of two so that it lies between the numbers 1 and 2. This can be performed for normalized numbers "x" in IEEE 754-1984 format by modifying the exponent bits without changing the sign or the mantissa.
In prior art, multiple precision arithmetic has been performed using either the normal integer operations of addition, subtraction, multiplication and shifting or using the normal floating-point operations of conversion to single precision format, addition, subtraction, multiplication, and a fused multiply-add instruction in which the result of the operation is obtained, in effect, by rounding the exact result of multiplying two numbers and adding a third.
There are disadvantages to using integer arithmetic. While the alignment of the product or sum of two numbers is predetermined and fixed when using such arithmetic, the number of bits which are produced by either a product or sum is unfortunately also predetermined and fixed. For example, a 32-bit by 32-bit multiply produces a 64-bit result. In one architecture, a single instruction may return a 64-bit result, while in another architecture, one instruction may return the low-order bits and another instruction the high-order bits. If, as in the example described above, it would be more convenient to work with a smaller number of bits, that option is available only with the addition of extra logical operations to shift or mask undesired bits from the product representation. Compounding this difficulty, typical integer units do not pipeline integer multiply instructions, exponents must be represented by bits in other computer words and manipulated separately, and the number of bits in an integer word may be smaller than the number of bits in a floating-point word, thus requiring that more integer words would be needed than floating-point words. Finally, if parts of a result are to be used as floating-point numbers, it is necessary to convert them from fixed-point format to floating-point format.
There are also disadvantages to using the normal floating-point arithmetic operations. The alignment of a floating-point result is data-dependent, since it depends on the value associated with the most-significant bit of the result. The length of a result is not data-dependent, but it cannot be varied by the programmer. Finally, the result which is returned is a rounded result, so that it is not uncommon for a number to be represented implicitly as the sum of two numbers of opposite sign. In spite of this, it is possible to implement multiple-precision arithmetic using these operations. For example, it is possible to take a number "x" which has a magnitude less than 2.sup.50 and compute its integer and fractional parts by the sequence:
If the rounding mode is round-to-zero, these instructions will do the job if the number big represents three times 2.sup.51 (assuming IEEE 754-1984 arithmetic). However, the second instruction depends on the result of the first, and the third depends on the result of the second. This means that a significant delay will take place in evaluating these terms in systems with pipelined Floating Point Units (FPUs).
Computing the most and least significant parts of a product is much more tedious. The problem is simplified if there is a fused multiply-add instruction. If "x" and "y" are two numbers less than 2.sup.50, the high and low parts may be computed as
Here, really.sub.-- big represents the number 3 * 2.sup.101. Each instruction still depends on the other two, so all of the latency problems associated with extracting a certain portion of a single number are present. In other instances, use of the fused multiply to separate a product into two parts which add up to the exact result may be less useful, because different terms may have different alignments (exponents) and as a result it is not easy to perform addition in a straightforward fashion on high and low parts coming from different products. One problem is that a programmer has no direct way to control the number of bits or the alignment of bits which are in one part or the other.
Without a fused multiply-add instruction, the standard method for getting parts of a product of numbers "x" and "y" would be to do things like:
This method uses a conversion to single-precision to reduce the number of significant bits in "x" and "y", and then takes the leading terms obtained by expanding (x.sub.-- high+x.sub.-- low) * (y.sub.-- high+y.sub.-- low). As seen, it is more complicated, has more instructions which depend on dependent terms, does not allow definite alignment of either the high part or the low part, and extends to higher precision only with more tedious and time-consuming operations.