Arithmetic processing circuitry for binary numbers as known in the art typically employs floating point arithmetic in accordance with the IEEE 754 binary format, or with the Hex-Extended format standard as implemented for example in IBM S/390 floating point processing circuitry. Floating point arithmetic, used in addition, multiplication, and division, first normalizes the binary numbers to be added, multiplied, or divided by shifting the binary numbers until, for a positive number, the first non-zero digit (i.e., 1) is immediately to the left of the radix point such that the mantissa part of the binary numbers is greater than or equal to 1 and less than 2. A negative binary number will have leading ones. Thus, to normalize a negative number, the number is shifted so that the first zero is immediately to the left of the radix point.
For multiplication, the normalized binary numbers are then multiplied and their exponents are added. For division, the normalized binary numbers are divided and their exponents are subtracted. For addition and subtraction, the normalized numbers are shifted (i.e., aligned) so that their exponents are equal, then the numbers are added or subtracted, respectively.
A fused ADD/multiply (FPU) circuit of the above type and operation is disclosed in U.S. Pat. No. 5,993,051, titled “COMBINED LEADING ONE AND LEADING ZERO ANTICIPATION.” An ADD circuit, which is able to be used either for instructions which operate on operands of regular length (e.g. 64 bit) or of extended length (e.g. 128 bit) is disclosed in, for example, IBM S/390 and z/Series computer systems produced in the year of 1999 or later, and in associated documentation such as “z/Architecture Principles of Operation,” International Business Machines Publication No. SA22-7832-00 (First Edition, December 2000), also in form of fused Multiply/Add circuitry. The regular use of e.g. a 64-bit dataflow is referred to herein as “narrow” dataflow in view of a “longer” 128-bit operand perspective.
The Instruction of a MULTIPLY AND ADD is described in “z/Architecture Principles of Operation,” International Business Machines Publication No. SA22-7832-00 (First Edition, December 2000), chapter 19.
The S/390 hardware processor architecture requires a so-called “extended” add (and subtract) operation, in which an instruction has a mantissa of 112 bits, but the fraction dataflow width is only 56 bits for the input registers and for the aligner unit (optimized for ‘long’ instructions).
Therein, a sum S is calculated from operands A and B:S=A+B,
where the mantissa of the floating point number having the smaller exponent is aligned according to the exponent difference between the bigger and the smaller exponent of the floating-point number. Within a “narrow” fraction dataflow optimized for ‘long’ operands, an extended (or quad precision) operand is divided into two (or respectively more) parts, and a respective number of suboperations are performed, to calculate the result sum.
An example is given as follows:
Definitions:                Exp (A) LESS THAN OR EQUAL TO Exp (B)        A=Ahigh+Alow         B=Bhigh+Blow         Aaligned-high=aligned(Ahigh)        Aaligned-low=aligned (A) i.e. Aaligned-low=aligned(Ahigh)+aligned(Alow),        Shigh=high part of raw sum        Slow=low part of raw sum.        
The following suboperations are performed in prior art, as given by the IBM S/390 architecture:
1. Calculate Exponent difference: the difference determines a shift amount for further alignment steps;
2. Align Ahigh (shift right) by amount of exponent difference which results in Aaligned-high, i.e. the operand A's mantissa is in the range of Bhigh;
3. Align Alow to the range of Blow->save as Alow-aligned-low 
4. Align Ahigh to the range of Blow->save as Ahigh-aligned-low 
5. Build Aaligned-low by concatenating Alow-aligned-low and Ahigh-aligned-low 
6. Add Aaligned-low and Blow;—to get Sraw-low and save carry_out
7. Add Aaligned-high and Bhigh and saved carry_out to get Sraw-high 
8. Do Normalization—build the final normalized Sum Shigh and Slow out of Sraw-high and Sraw-low (different cases, since Sraw can have leading zeros)
This operation typically takes 13 cycles in prior art implementations, such as specified in the above IBM S/390 architecture.
The disadvantage of prior art implementations such as the IBM S/390 architecture is the high number of 13 cycles used, and the fact that the control logic required to do the above suboperations is complex, since the respective normalization procedure is complex.
For the foregoing reasons, therefore, there is a need in the art for an improved floating point adder unit and corresponding method for extended floating point ADD operations in a “narrow” dataflow.