This invention relates to processing systems which operate on floating point numbers, more specifically, the invention relates to an efficient mechanism for performing accurate mathematical rounding of such numbers.
In digital computing systems, various types of numbers are electronically represented using the binary numbering system. Floating point numbers, such as xe2x88x921.73491*10xe2x88x9213 are typically represented in binary using either a VAX or an IEEE floating point standardized format. In either standard, the floating point number is represented as a group of bits divided into three bit fields: a sign bit field, an exponent bit field and a fraction bit field. The sign bit field represents the sign (negative in the above example) of the subject floating point number. The fraction bit field represents the digits surrounding and including the decimal point (i.e., 1.73491 in the above example). Finally, the exponent bit field (e.g. xe2x88x9213 in the above example) represents the multiplier of ten which indicates how many places and in which direction to shift the decimal point in the fraction part of the subject floating point number if it were to be expressed in typical decimal format.
Depending upon the standard in use, there are particular required formats used to represent the fraction and exponent bit fields. In the IEEE standard for normal numbers, the decimal point in the fraction bit field is always assumed to be located just to the right of the most significant bit position. For example, if there are 23 bits in the fraction bit field having bit positions ranging from 0 (rightmost bit) to 22 (leftmost and most significant bit), the decimal point is always assumed to be located between bit positions 22 and 21. In the VAX standard, the decimal point in the fraction bit field is always assumed to be located just to the left of the most significant bit position (to the left of bit position 22 in the above example). Also, in both the VAX and IEEE standards, a normal fraction value is always stored in a normalized state. A xe2x80x9cnormalizedxe2x80x9d fraction bit field always has the most significant non-zero bit located in the most significant (left most) bit position.
All exponents use an excess format, the exponent value is calculated by taking the unsigned value of the exponent bit field and subtracting a bias to produce the true exponent value. A bit field value of 1 represents the most negative true exponent, a bit field value of all one""s represents the most positive true exponent, and the bit field value half way between 1 and all one""s represents a true exponent value of zero.
The number of bits in the fraction bit field and the number of bits in the exponent bit field determines the precision and range (i.e., the number of significant digits and the maximum and minimum floating point numerical values representable) of a particular floating point format. Both the VAX and IEEE standards provide for single and double precision floating point numbers. Double precision floating point numbers use about twice as many bits for their fraction fields as single precision floating point numbers. A typical single precision floating point number requires a total of 32 bits to store the sign, fraction and exponent fields, while a typical double precision value requires a total of 64 bits for storage.
Various steps must be performed to add two floating point numbers using prior art floating point addition circuits. Before addition can take place, the exponent of the smaller magnitude operand must be adjusted so that it is equal to the exponent of the larger magnitude operand. This is accomplished by incrementing the smaller magnitude operand""s exponent while shifting that operand""s fraction appropriately such that the value of the combined fraction and exponent is maintained. As an example, if the first and second operands are +0.1234*105 and +0.5678*107 respectively, to perform the adjustment, the floating point processor adds two to the smaller exponent, i.e., the first operand""s exponent (105), to equate it with the exponent of the second operand (107). To maintain the proper value for the smaller magnitude operand, its fraction must be shifted by two decimal places. The combined fraction and exponent becomes +0.001234*107 for the adjusted (first) operand.
After the alignment and shift steps are complete, the fraction bit fields (i.e., the fractional values) of the two operands are added in an addition step to produce a result reflecting the sum of the fractions of the operands. In this example, after the addition is complete the resultant sum is +0.569034*107. In some instances, depending upon the value of the resultant sum, the sum may then need to be normalized so that its most significant digit is in the proper decimal position for the resultant format. Normalization is not needed in the above case.
Furthermore, the resultant sum may also exceed the overall precision that can be represented by the floating point standard in use. For example, if the fraction bit field format only has enough bits to represent a precision of four decimal digits to the right of the decimal point, the example resultant fraction value 0.569034 exceeds the allowable precision by two digits. If the precision is exceeded, a rounding step is used to round the fraction up or down to fit within the maximum number of bits allocated for the fraction bit field.
In the VAX floating point standard, there are two rounding modes that can be used, and in the IEEE floating point standard there are four rounding modes that can be used to accomplish the rounding step.
In the IEEE standard, the first rounding mode is called xe2x80x9cRound to Nearest Evenxe2x80x9d (RNE) and rounds values up in magnitude if they are more than half way between two representable results. Values that are exactly half way between two representable results are rounded to a final result that has a least significant fraction bit equal to zero, thus making the result even. Values that are less than halfway between two representable results are rounded down in magnitude (or truncated).
The second and third IEEE rounding modes are called xe2x80x9cRound Toward Positive Infinityxe2x80x9d (RTPI) and xe2x80x9cRound Toward Negative Infinityxe2x80x9d (RTNI). In the RTPI rounding mode, values that are between two representable results are rounded up for positive results and down in magnitude for negative results. In the RTNI rounding mode, values that are between two representable results are rounded up in magnitude for negative results and down for positive results.
The fourth IEEE rounding mode is called xe2x80x9cChoppedxe2x80x9d and rounds all results existing between two representable results down in magnitude by chopping off or eliminating any digits extending beyond the precision (i.e., number of decimal places) allowed.
In the VAX floating point standard, there are only two rounding modes; xe2x80x9cNormal Roundingxe2x80x9d and xe2x80x9cChopped.xe2x80x9d In Normal Rounding, values that are more than or exactly half way between two representable results are rounded up in magnitude. Values that are less than halfway between two representable values are rounded down in magnitude. The Chopped rounding mode in the VAX standard is the same as the IEEE standard and rounds results down in magnitude by chopping off or truncating any bits below the available precision.
Except for the Chopped rounding mode, all rounding modes are accomplished by conditionally incrementing the infinitely precise normalized initial sum at an appropriate bit position, re-normalizing if necessary, and then truncating all bits below the least significant bit position. After the initial normalized sum is computed, the rounding mode in effect determines a specific bit position in the sum at which to increment the result in order to create a fraction bit pattern representing a correctly rounded fraction value. The round increment may cause a carry bit to be propagated to the more significant bit positions in the sum. If the carry due to round increment causes the fraction value to exceed the allowed fraction magnitude, then the fraction must be re-normalized by shifting down in magnitude and the exponent needs to be incremented by one. After incrementing and re-normalizing, the final result is obtained by truncating at the least significant bit to position.
In summary, prior art floating point processors that provide mathematical operations need a final addition and rounding function which requires the following steps; 1) Add, 2) Adjust/ Normalize, 3) Round and 4) Adjust/Re-normalize. A floating point adder that performs these steps in one operation is called a rounding adder and is typically implemented in a floating point unit as circuitry within a microprocessor.
Note that the earlier steps always provide prealigned operands to the addition step (Step 1). The resultant sum produced by the add therefore always contains a leading non-zero digit (i.e., the most significant bit or MSB) that is guaranteed to be in one of two possible bit positions; either properly normalized or needing a one bit shift to be properly normalized. So, the normalization in step 2 may or may not require a single bit shift, depending on the bit position of the MSB in the sum result from step 1. If the shift is needed, every bit in the fraction bit field is shifted. However, whether a shift is required or not is problematic for combining steps 1 through 4 in one operation because the round increment in step 3 requires the shift result from step 2 which requires the sum from step 1. At the start of step 1, only the two operands and a round increment value are known. As such, upon initial receipt of these three inputs, prior art rounding adders make it difficult to determine the bit position where the round increment bit will be needed for the rounding operation that occurs in Step 3. This is due to the fact that a shift operation may or may not be needed in Step 2. In other words, the proper bit position required for the round increment bit is unknown at the start because it is not known until after the addition (Step 1) if the adjust step (Step 2) will be needed.
Prior art implementations of rounding adders handle the uncertainty of the round increment bit position by using three separate addition circuits. One circuit performs addition without any round increment bit and computes a first result. The second circuit accepts the round increment bit at a low round increment bit position and computes a second result. Finally, a third circuit accepts a round increment bit at a high round increment bit position and computes a third result. In essence, three separate addition operations are performed using separate circuits. After all three results are obtained the correct result is selected (from the second and third results) based upon the most significant bit that exists in the first result.
The second and third rounding adder circuits must add two operands (i.e., bit strings) in conjunction with a round increment bit injected into the addition operation at specific high and low round increment bit positions. One problem encountered in performing these additions is that the round increment bit may need to be added to a bit position which already needs to add two operand bits plus a carry in from a lower bit position. The four bits required to be added in this one bit position can not be represented as a simple resultant sum and carry out.
To avoid extra circuitry required to ensure that rounding bits are properly carried and propagated, a series of half adders are used to receive the bits of the operands. A single half adder accepts two bits and produces a sum bit and a carry output bit. The carry output bit is used as an input to the next more significant bit position. An example will best explain how the addition of a half adder assists the addition circuitry used in the prior art.
The example below illustrates the results of a half adder used to add two operands and a rounding increment bit inserted at the K bit position. In the first example, without a half adder present at the inputs, two operands and a round increment bit can be added as follows:
However, with a half adder which first accepts the two operands (A and B) and converts them to a Sum and Carry string, the following result is obtained:
The final result is the same in each case. However, for the addition without the half adder, notice that the K bit position has to both generate a carry bit and also propagate a carry bit from the L bit position. Thus there are two carries from the K bit position into the J bit position. When the same operands are pre-processed through the half adder stage as shown in the second example, there is only one carry bit created from the K bit position into the J bit position. The result is the same but the physical implementation of the circuit is simplified using a half adder due to the fact that multiple carry bits do not need to be generated and/or propagated which requires additional circuitry and processing time.
Prior art rounding adders suffer from a number of problems. The requirement for three separate addition circuits to accommodate calculations for a high increment rounding bit, a low increment rounding bit, and no rounding bit requires extra processing time and space and also uses more power.
A total of four processing steps are required to perform the entire floating point round addition found in the prior art. The four steps used in the prior art result in a slower floating point addition circuit which in turn results in slower overall floating point mathematical calculations. As will be explained, the present invention provides a mechanism to condense the number of steps needed to perform the same overall operation to one step and eliminates the need to calculate an addend using no round increment bit.
The invention overcomes the shortcomings of prior art rounding adders. The invention uses full adders at those bit positions which must accommodate each operand as well as a rounding increment bit. Since the rounding bit is handled by the full adders, multiple carries from a single bit position continue to be avoided. The use of full adders also eliminates certain steps that are required in the prior art floating point addition operation. Specifically, special adder circuits that provide addition of three bits in certain bit positions, and the need for threshold logic are eliminated. The steps of addition, adjusting, rounding and then further adjusting can be combined in a more standard carry propagate adder.
The invention also eliminates the need for performing a no rounding increment addition calculation. This third and unnecessary computation is removed by this invention as a result of the discovery that the most significant bit of the addition result produced from adding a rounding increment bit at a low increment bit position can be used to select a correct result from either the low or a high rounding increment addition result.
More specifically, the present invention provides a method and apparatus for performing rounded floating point additions on first and second operands. The apparatus is called a rounding adder circuit.
The rounding adder circuit includes a low increment adder circuit that accepts as input the first and second operands and a low increment bit injected into a first pre-selected low order bit position. The first pre-selected low order bit position is selected based upon a function of the rounding mode in effect and upon the desired mathematical operation being performed. The low increment adder circuit adds the first and second operands and the low increment bit and accounts for any carry bits generated from the addition and produces a low increment result. A low increment sum logic circuit is included and performs sum logic functions on the low increment result based upon the desired mathematical operation to produce a final low increment result.
The rounding adder circuit also includes a high increment adder circuit accepting as input the first and second operands and a high increment bit injected into a second pre-selected low order bit position. The second pre-selected low order bit position is also selected as a function of the rounding mode in effect and the desired mathematical operation being performed. The high increment adder circuit adds the first and second operands and the high increment bit, and accounts for any carry bits generated and produces a high increment result. A high increment sum logic circuit performs sum logic functions on the high increment result based upon the desired mathematical operation to produce a final high increment result. An output selection circuit selects either the final low increment result or the final high increment result depending upon a most significant bit of the final low increment result.
Through the use of only a high and low increment addition circuit, with the final result being selected based upon the most significant bit of the low increment result, the rounding adder eliminates the prior art requirement of a no increment addition circuit. This simplifies floating point unit circuit design and reduces real estate and power requirements on a microprocessor implementation of the rounding adder circuit.
Another advantage of the invention is that the low increment adder circuit and the high increment adder circuit share a high order bit addition circuit. This single high order bit addition circuit includes half adders coupled in sequence, with one half adder per high order bit position of the first and second operands. Each half adder accepts as input a respectively positioned high ordered bit from each of the first and second operands. Each half adder performs an addition operation and produces a half adder result for that bit position. By using half adders where there are only two inputs, and full adders where there are three inputs, the invention circuit accommodates the rounding increment bits more efficiently than prior art rounding adders.
A series of high order propagate-generate-kill (PGK) circuits coupled in sequence is also included in the high order bit addition circuit. In particular there is one propagate-generate-kill circuit per high order bit position of the operands. Each high order propagate-generate-kill circuit accepts as input the half adder result from the half adder in its respective bit position and performs a process of either propagating, generating or killing a carry bit for its respective bit position to produce a high order PGK result.
For addition of the low order bits of the operands, which are the lowest four bit positions in the preferred embodiment, the low order bit addition circuit provides a plurality of low increment full adders coupled in sequence, one per low order bit position of the first and second operands. Each low increment full adder accepts as input a respectively positioned low ordered bit from the first operand, a respectively positioned low ordered bit from the second operand, and a low increment bit. Each low increment full adder performs an addition operation and produces a low increment full adder result for that bit position.
Also part of the low increment adder circuit and coupled to the low increment full adders are a plurality of low increment propagate-generate-kill circuits coupled in sequence, one per low order bit position in the operands. Each low increment propagate-generate-kill circuit accepts the low increment full adder result from the full adder in its respective bit position and performs a process of either propagating, generating or killing a carry bit for its respective bit position, to produce a low increment PGK result.
Existing and operating in symmetry with and in parallel to the low increment adder circuit is a high increment adder circuit. The high increment adder circuit construction is the same as the low increment adder circuit except that the high increment adder circuit accepts as input a high increment bit as the third input at each fill adder, instead of a low increment bit, and the high increment adder circuit produces a high increment PGK result.
The low order bit addition circuit includes low and high increment carry logic circuits which accept as input the respective low and high increment PGK results. The low and high increment carry logic circuits operate in parallel and are symmetrical and each determines if a respective low or high increment carry bit is present in the respective low or high increment PGK result. If so, the low and high increment carry logic circuits output the respective low or high increment carry bit and a low order low or high increment result.
For the high order bits, the high order PGK result is input into a dual carry logic circuit. One part of the dual carry logic is a low increment carry chain which combines the high order PGK result with the low increment carry bit to propagate the low increment carry bit within the high order PGK result to produce a high order low increment result. A second part of the dual carry logic circuit is a high increment carry chain which combines the high order PGK result with the high increment carry bit, to propagate the high increment carry bit within the high order PGK result, to produce the high order high increment result.
The sum logic circuits for the high and low order bits ensure that the carry bits generated from the addition operation are properly accounted for in the results of the operand additions for both the high increment result and the low increment result.
The low increment sum logic circuitry includes low order low increment sum logic circuitry and high order low increment sum logic circuitry. The high increment sum logic circuitry includes low order high increment sum logic circuitry and high order high increment sum logic circuitry. Each of these sum logic circuits performs sum logic functions on respective results to produce a final low increment result and a final high increment result.
Accordingly, after the invention has added the operands in combination with both the low and high increment rounding bits and has performed any necessary sum logic, two final results are present. The invention then uses the most significant bit of the final low increment result to select one of the final high or low increment results as being the correct final result. In addition during the final result selection, the invention also combines the steps of shifting and adjusting after the round operation.
The invention provides the advantages of being able to inject a rounding bit or other increment bit at any bit position, by using full adders at those bit positions where two operand bits and an increment bit are to be received. The precise bit positions at which the high and low increment bits are injected into the addition operation are dependent upon the desired mathematical operation, the rounding mode in effect, and the input operands. By selecting the proper pre-determined positions, the rounding adder circuit computes a correct result with no need for the third no-increment addition operation of the prior art. In effect, the four steps of the prior art are reduced to one step by this invention.