1. Field of the Invention
The present invention relates generally to a hardware arrangement for implementing floating-point addition and/or subtraction, and more specifically to such an arrangement which enables an effective reduction of execution time of floating-point addition and/or subtraction with only a small increase in relatively simple hardware.
2. Description of the Prior Art
Floating-point arithmetic operations are more complicated than those with fixed-point numbers, take more time to execute, and require more complex hardware to implement. In most computers, arithmetic operations are implemented with normalized floating-point numbers. Therefore, all the numbers must be pre-normalized before they can be manipulated. After every intermediate computation step, a post-normalization procedure must be applied to ensure the integrity of the normalized form.
The subtraction and addition of floating-point numbers are implemented by the same arithmetic operations and accordingly, the addition will mainly be referred to throughout the instant specification.
The addition of floating-point numbers requires that the operands be "scaled" before the arithmetic operations are carried out so that they have equal exponents. Since the numbers are assumed to be initially normalized, scaling implies that the smaller number has to be shifted to the right until its exponent equals that of the larger number.
Addition is carried out by adding the two fractions, leaving the exponents untouched. When two normalized numbers are added, the result may contain an overflow digit. Correcting the overflow is accomplished by shifting the sum fraction once to the right, and incrementing the exponent value by 1. Similar algorithm is applicable to the subtraction of floating-point numbers.
In order to keep the precision of the floating-point number which is determined by the fraction length, "rounding" is implemented after normalizing the result of sum.
According to a paper entitled "IEEE Standard for Binary Floating-Point arithmetic (ANSI/IEEE Std 754-1985)" (hereinafter referred to as paper No. 1), published Aug. 12, 1985, several "rounding" modes have been proposed which will be described in later.
FIG. 1A illustrates a single precision format (32 bits) in a floating-point number, while FIG. 1B a double precision format (64 bits) in same kind of number, both defined by the IEEE floating-point standard 754. The single- and double-precision formats use radix 2 for fractions and excess notation for exponents.
Each of the formats shown in FIGS. 1A, 1B starts with a sign bit (s) for the number as a whole, 0 being positive and 1 being negative. Next comes the exponent (E), using excess 127 for single precision and excess 1023 for double precision. A fraction (F) of 23 bits follows for the single precision format, and a fraction (F) of 52 bits for the double precision format. In FIGS. 1A, 1B, LSB is an abbreviation for Least-Significant Bit.
A general arithmetic algorithm for floating-point addition will briefly be described.
It is assumed that: (a) two floating-point numbers to be added are denoted by A and B as shown below, (b) .vertline.A.vertline..gtoreq..vertline.B.vertline. (c) the sum is denoted by C. EQU A=(-1)Sa.times.2.sup.(Ea-bias) .times.Fa EQU B=(-1)Sb.times.2.sup.(Eb-bias) .times.Fb (1)
where:
Sa, Sb: sign bit; PA1 Ea, Eb: unbiased exponent; and PA1 Fa, Fb: fraction (1.0.ltoreq.Fa, Fb&lt;2.0). PA1 (a) C is represented by (-1)Sc.times.2.sup.(Ec-bias) .times.Fc; and PA1 (b) the fraction (Ft.times.2.sup.k) in equation (3) is rounded to n-bit fraction, then ##EQU2## However, in the case where Fd=2.0-2.sup.-(n-1) and a rounddown is carried out, then Fc=2.0. Accordingly, EQU Sc=Sa EQU Ec=Ea-k+1 EQU Fc=1.0 PA1 (a) the two floating-points numbers A and B as indicated in equation (1), are applied to the FIG. 3 arrangement 10; PA1 (b) each of the numbers A, B takes the form of double precision format (64-bit) and hence consists of one bit sign, an 11-bit exponent and a 52-bit fraction; and PA1 (c) the exponent Ea is larger than Eb.
Then, ##EQU1## where: EQU Ft=Fa+(-1)(Sa+Sb).times.2.sup.-(Ea-Eb) .times.Fb (2)
In equation (2), the term 2.sup.-(Ea-Eb) implies that two operands A and B are scaled before addition.
Since 0.0.ltoreq.Ft&lt;2.0, normalization is necessary for Ft.
It is understood that if Ft=0.0 then C=0.0.
Thus, defining Ft by 2.sup.-k .ltoreq.Ft&lt;2.sup.-k+1 (k is an integer satisfying -1.ltoreq.k), EQU C=(-1)Sa.times.2.sup.(Ea-bias-k) .times.(Ft.times.2.sup.k) (3)
Then, the sum C in equation (3) undergoes rounding.
As discussed in a paper entitled "A Proposed Standard for Binary Floating-Point Arithmetic" (copyright 1981, IEEE), reprinted with permission from COMPUTER, 10662 Los Vaqueros Circle, Los Alamitos, Ca 90720 (hereinafter referred to as paper No. 2), the rounding defined by the above-mentioned IEEE floating-point standard 754 requires additional 3 bits (viz., guard, round and sticky bits) which are positioned to the right of the least-significant bit of the fraction F as shown in FIG. 2.
As mentioned in paper No. 2, the 3 bits affixed to LSB ensure accurate unbiased rounding of computed results to within half a unit in the least-significant bit. Two bits are required for perfect rounding: the guard bit is the first bit beyond rounding precision, and the sticky bit is the logical OR of all bits thereafter. To accommodate post-normalization in some operations, the round bit is kept, beyond the guard bit, and the sticky bit is a logical OR of all bits beyond round.
It is assumed that:
In the above-mentioned algorithm of floating-point addition, the absolute values of the numbers A and B are compared to determine which value is larger. However, as disclosed in a paper entitled "The IMB System/360 Model91: Floating-Point Execution Unit (IBM Journal, January 1967" (hereinafter referred to as paper No. 3), the comparison between the exponentials Ea and Eb prevails rather than that between .vertline.A.vertline. and .vertline.B.vertline..
Before turning to the present invention, a known arrangement for floating-point addition (or subtraction) will be described with reference to FIG. 3. The FIG. 3 arrangement is configured based on the IEEE floating-point standard 754.
Merely for the convenience of description, it is assumed that:
In FIG. 3, each of the numbers parenthesized indicates a bit length. The sign bits Sa, Sb of the numbers A, B are inputted to an exclusive-OR gate 12 and a selector 14, while the exponents Ea, Eb of the number A, B are applied to a subtractor 16 and a selector 18. Further, the fractions Fa, Fb are added by "1" and then applied to selectors 20, 22, respectively. The subtractor 16 calculates .vertline.Ea-Eb.vertline. and applies the result thereof to a barrel shifter 24, and outputs a signal 26 indicating which is larger between Ea and Eb. The signal 26 is inputted to the selectors 14, 18, 20 and 22, As mentioned above, if Ea is larger than Eb, the selectors 14, 18 and 22 select Sa, Ea and Fa respectively, while the selector 20 selects Fb. (If Ea is smaller than Eb, the selectors 14, 18, 20 and 22 select the other one.)
The barrel shifter 24 is supplied with the exponent Eb (53-bit) from the selector 20 in that the exponent Ea has been assumed to be larger than the exponent Eb, and then adds three bits to the right of the LSB of the exponent Eb applied. The three bits thus added in the barrel shifter 24 are a guard bit (G), a round bit (R) and a sticky bit (S) as referred to in connection with FIG. 2. Thus, the output of the barrel shifter 24 has a 56-bit length. The barrel shifter 24 shifts the output of the selector 20 to the right by the value derived from the subtractor 16 (viz., (Ea-Eb)) for scaling of the exponents Ea, Eb. In order to conform to the bit length of the output of the barrel shifter 24, three zero bits (viz., 000 as illustrated) are added to the exponent Ea outputted from the selector 22 before being applied to an adder/subtractor 28.
It is assumed that the two inputs to the adder/subtractor 28 are denoted Fs and Fl as shown. In the event that the output 13 of the exclusive-OR gate 12 assumes 0 (viz., in the case of Sa=Sb), the adder/subtractor 28 implements addition (viz., Fs+Fl). On the other hand, if the gate 12 issues the output 13 assuming 1 (in the case where Sa does not equal Sb), the adder/subtractor 28 produces an output representing the result of (.vertline.Fl.vertline.-.vertline.Fs.vertline.). The adder/subtractor 28 adds one bit to the left of the most significant bit (MSB) thereof, and thus issues a 57-bit output. The adder/subtractor 28 further issues an output 30 to be applied to another exclusive-OR gate 32. The output 30 assumes 1 if the output 13 is 1 and simultaneously Fa&lt;Fb. Otherwise, the output 30 assumes 0.
A barrel shifter 34 is supplied with the output of the adder/subtractor 28 (57-bit). On the other hand, a leading zero detector 36 receives the output of the adder/subtractor 28 (56-bit=57-bit-LSB), and counts up the number of leading zero(s). The detector 36 supplies the barrel shifter 34 with the output thereof (denoted by numeral 37) for normalization, which assumes a value ranging from -1 to 53. Thus, the barrel shifter 34 shifts the output applied from the adder/subtractor 28 to the left by the number of bits indicated by the output 37, thus implementing normalization. It should be noted that the lowest bit of the output from the adder/subtractor 28 remains at the position thereof irrespective of the bit shifting to the left at the barrel shifter 34. Further, a subtractor 38 receives the output 37 from the detector 36 and subtracts the output 37 from the exponent Ea in this particular case. Accordingly, the exponents Ea is adjusted in order to conform to the normalization carried out at the barrel shifter 34. Since an integer bit attached to the normalized fraction assumes 1, the output of the barrel shifter 34 includes a fraction only.
A rounding decoder 40 receives the least 4 bits of the normalized fraction from the barrel shifter 40, wherein the least 4 bits are the LSB of the fraction, the guard bit (G), the round bit (R) and the sticky bit (S) as best seen from FIG. 2. The rounding decoder 40 decodes the 4 bits applied thereto according to the default rounding mode which is defined by the IEEE floating-point standard 754, and issues an output 41. In more specific terms, this decoding is implemented according to the following Table 1.
TABLE 1 ______________________________________ LSB G OR(R,S) Output of decoder 40 ______________________________________ x 0 0 0 x 0 1 0 0 1 0 0 1 1 0 1 x 1 1 1 ______________________________________
where G is a guard bit, and OR(R,S) implies the result of OR function of round and sticky bits (R, S).
An incrementer 42 receives the output 35 of the barrel shifter 34 (viz., Fd-1.0), the output 39 of the subtractor 38 (viz., Ec) and the output 41 of the rounding decoder 40. If the output 41 assumes 1, the incrementer 42 issues a 64-bit output 43 plus 2.sup.-52. The 64-bit output 43 is: EQU (output 33 (=Sc))+(output 39 (=Ec))+(output 11 (Fc))
On the other hand, if the output 41 assumes 0, the output 43 is: EQU (output 33 (=Sc))+(output 39 (=Ec))+(output 11 (Fc))
The sign bit Sc is generated from the gate 32. In most cases the sign bit Sc equals the output of the selector 14. However, only in the event that Ea=Eb and Fa&lt;Fb, even if Eb.ltoreq.Ea, Fa&lt;Fb. Accordingly, the gate 32 should be provided to which the signal 30 is applied as shown in FIG. 3.
The operation time in the FIG. 3 arrangement will be discussed in the following. A critical path for executing the above-mentioned arithmetic operations is shown in FIG. 4. The operation times required at the blocks of the critical path shown are listed below.
______________________________________ Subtractor 16 15 ns Selectors 20, 22 5 ns Barrel shifter 20 30 ns Adder/subtractor 28 30 ns Leasing zero detector 36 30 ns Barrel shifter 34 30 ns Rounding decoder 40 5 ns Incrementer 42 35 ns Total 180 ns ______________________________________
The known arrangement shown in FIG. 3 has encountered the problem that the total time of the operation along the critical path (FIG. 4) is undesirably lengthy. This is because carry propagation over long bit-length are required at the adder/subtractor 28 and the incrementer 42. Further, the incrementer 42 is unable to initiate its operation until the barrel shifter 24 completes the operation thereof. Still further, the two barrel shifting stages increases the overall execution time.