1. Field of the Invention
The present invention relates to a floating-point accumulator, particularly relates to an accumulator applied in the simulation of signal processing and a physical phenomenon and others for accumulating a floating-point number with limited word length.
2. Description of the Related Art
An accumulator has been widely used in the field of image processing and others. For example in a digital filter, an accumulator is used for an operation of multiplying each of plural pixel values by a specific coefficient and accumulating data after multiplication.
Heretofore, such an accumulator is constituted by a register 111 for holding data and an adder 110 for adding input data to be accumulated and data held in the register 111 as the result of addition as shown in FIG. 11 and data is accumulated in the order of input.
That is, as shown in FIG. 11, data to be accumulated is input from input 100 in order in synchronization with a clock signal not shown and the register 111 holds data added by the adder 110 in synchronization with this clock signal. At this time, the data input from the input 100 is added by the adder 110 to the data held in the register 111 and the sum is again stored in the register 111. As described above, input data is directly added to accumulated data in order, stored in the register 111 and when accumulation is finished, the result of the accumulation is output from output 101.
However, if floating-point data the word length of which is fixed is accumulated as described above, a problem that the precision of the result of accumulation is deteriorated due to a rounding error caused by rounding down, rounding up and rounding off occurs. Particularly, if the difference between the exponent of an added result and that of input data is great, the number of bits included in the added result exceeds the range of that of bits represented by a floating-point number and the precision is deteriorated. If many small numbers are input next to a large number, the sum is finally an unallowable value in view of the precision of calculation because the errors of the above numbers are accumulated.
The above phenomenon will be described, giving a case that 257 pieces of numbers represented by a floating-point number are accumulated as an example. A floating-point number is represented by the following format: First, a word is separated and a sign part, a fixed-point part and a characteristic are separately stored. The absolute value of a number is stored in the fixed-point part consisting of 8 bits, the most significant bit (MSB) of the fixed-point part is always set to `1` so as to simplify description and a number the MSB of the fixed-point part of which is `0` shall not be considered. The sign of a floating-point number is specified by a bit of the sign part. Further, the characteristic is represented by sign digits consisting of 5 bits for example and an exponent when a base is `2` is held in the characteristic. An accumulated value `sum` is obtained by calculating the following expression: EQU sum=.SIGMA.a(i), i=1 to 257.
However, a(1)=2.sup.-1, a(2)=a(3)= . . . =a(257)=2.sup.-9.
The precise result of the above calculation is 1 as follows: EQU sum=2.sup.-1 +2.sup.-9 .times.256=2.sup.-1 +2.sup.-9 .times.2.sup.8 =2.sup.-1 +2.sup.-1 =1.
In the meantime, in case the above calculation is executed in a circuit shown in FIG. 11, the fixed-point part of floating-point data with the smaller characteristic is first shifted by the difference between both exponents toward the least significant bit (LSB) in the adder 110 so as to align the points of two inputs because data a(i) is input in order from the input 100 and added by the adder 110. Bit data exceeding the area consisting of 8 bits in the fixed-point part is truncated by the above shift. Next, the numbers which both consist of 8 bits are added to obtain the added result consisting of 9 bits. Afterward, the added result is shifted so that MSB is `1` to obtain the added result. Therefore, when the MSB of the added result consisting of 9 bits is `1`, the added result is shifted by one bit toward LSB. At this time, the LSB of 9-bit data before shift is truncated because the data exceeds an area consisting of 8 bits.
The above description will be described further in detail below. As the data a(2) and the following data are input in order from the input 100 after the data a(1) is input from the input 100 and stored in the register 111, the data a(1) which is 2.sup.-1 and stored in the register and the data a(2) which is 2.sup.-9 and next input are added according to the above addition procedure in the adder 110 when the data a(2) is input.
In the fixed-point part of 2.sup.-9, only MSB is `1` and the residual bits are all `0`. In the characteristic of the above value 2.sup.-9, the bits are shifted by 8 bits toward LSB before addition because the difference between the exponent of 2.sup.-9 and that of 2.sup.-1 is 8. However, as the fixed-point part consists of 8 bits, `1` in MSB is truncated because it exceeds the fixed-point part because of the shift and all bits in the fixed-point part are `0`. As described above, if values in the characteristics of the added two data are greatly different, the precision of data the value of which is small is deteriorated because of a shift operation. As a result, the first added result is 2.sup.-1 as a result of adding 2.sup.-1 and 0 and 2.sup.-1 is stored in the register 111.
Afterward, as the same value 2.sup.-1 as the data a(1) is stored in the register 111 although the data a(3) and the following data which have the same value as the data a(2) are input in order from the input 100, two data input to the adder 110 are the same values as in case the data a(1) and a(2) are added. As a result, the added result is also the same as that of the data a(1) and a(2) and a value 2.sup.-1 stored in the register 111 is unchanged.
Therefore, after 257 pieces of data are input from the input 100 and accumulation is finished, 2.sup.-1 is output as an accumulated value from an output 101. The above value is a half of the above precise value.
The above phenomenon will be further analytically examined below. First, if a fixed-point part consists of p bits, a relative error er (.vertline.er.vertline..ltoreq.2.sup.-p) occurs in one floating-point adding operation. If floating-point addition by the above accumulator is represented as +' and precise addition (addition in an ideal state in which no error occurs) is represented as +, the addition of a and b by the accumulator is represented in the following expression: EQU a+'b=(1+er)(a+b)=(1+er) a+(1+er) b.
That is, the result of one floating-point addition is a value in which the respective relative errors er of a and b are added to a value obtained by adding a and b differently from the precisely added result.
According to the above expression, an error in case N pieces of numbers a1 to aN are accumulated using the accumulator shown in FIG. 11 will be analyzed below. In this case, the following expression is effected because floating-point addition is executed (N-1) times: EQU (. . . ((a1+'a2)+'a3)+' . . . +'aN)= EQU (1+er)(. . . (1+er)((1+er)(a1+a2)+a3)+. . . +aN)= EQU (1+er).sup.N-1 a1+(1+er).sup.N-1 a2+(1+er).sup.N-2 a3+(1+er).sup.N-3 a4+ . . . +(1+er)aN.
That is, for a1 and a2 of N pieces of numbers when the above expression is represented in the form of the precise sum of the N pieces of numbers, the relative error of "(1+er).sup.N-1 -1" is included in each, as a result, the maximum relative error is "(1+2.sup.-p).sup.N-1 -1" and the minimum relative error is "(1-2.sup.-p).sup.N-1 -1".
In the above example, as N=257 and p=8, the maximum relative error of a first input value is "(1+2.sup.-8).sup.256 -1=1=1.71 " and the minimum relative error is "(1-2.sup.-8).sup.256 -1=-0.63". Therefore, if the values of a1 and a2 are larger than the other 255 pieces of values, the effect of the relative error upon the accumulated result is increased.
To prevent the above deterioration of precision, heretofore an error of one added result is calculated by a subtracter and the error is accumulated by another adder and added to the added result in technique shown in Japanese Published Unexamined Patent Applications No. H1-169627, No. H4-281524 and others.
However, in the above prior examples, there is a defect that three or four respective other floating-point adders and floating-point subtracters are required and the scale of a circuit is enlarged because of a barrel shifter and others included in the floating-point adder.
Also in the prior examples, the effect that precision is enhanced by increasing the number of bits in data can be obtained as a result, however, as the order of addition is equal to that of accumulation shown in FIG. 11, precision comes into question as in the case shown in FIG. 11 when the frequency of addition is greatly increased.
In the above prior examples, the case is equivalent to a fact that the bit length p of the floating-point part is extended to 2p in a floating-point operation. Therefore, the relative error er is ".vertline.er.vertline..ltoreq.2.sup.-2p ". As a result, the maximum relative error of a first input value is "(1+2.sup.-2p).sup.N -1" and the minimum relative error is "(1-2.sup.-2p).sup.N -1" by the same analysis as the above one.
As described above, when the value of N is large although the value itself of a relative error er is reduced, the relative error er is so large that it cannot be ignored.
Recently, in the field of computation in chemistry, a method called an ab initio molecular orbital method has been frequently used to calculate the quantum mechanica1 energy of a molecule. In this field, the quantity of computation is often very enormous. In the concrete, if computation is executed by a method called Hartree-Fock approximation, normally energy is calculated using a matrix called Fock matrix. At this time, to calculate one element of Fock matrix, accumulation one hundred million times is required in the case of a large molecule.
Therefore, although the above prior examples provide sufficient precision in the field of image processing and others, there is a problem that the prior examples cannot provide sufficient precision in a field in which the quantity of computation is enormous such as a molecular orbital method.