When operating with binary numbers representing fractional values, there is often a need to perform rounding on the number. This is particularly the case if a fractional number needs to be converted to an integer.
There are several different types of rounding schemes that can be used. For example a round-to-nearest scheme will round the fractional number to the nearest integer. Therefore, the number 2.4 is rounded to 2, and the number 2.6 is rounded to 3. A problem with a round-to-nearest scheme is that it has a bias towards rounding up the value, as results having a fractional part of exactly ½(i.e. the number can be represented as ending in x.5 in decimal, where x is any number) are always rounded up, and therefore a larger proportion of fractional numbers are rounded up rather than down.
This problem with round-to-nearest can be overcome by using a round-to-nearest-even scheme. With the round-to-nearest-even scheme, if the result ends in exactly x.5, then the result is rounded to the nearest even number. For example, the number 1.5 is rounded up to 2, the number 2.5 is rounded down to 2, the number 3.5 is rounded up to 4, and the number 4.5 is rounded down to 4. Therefore, it can be seen that there is no overall bias to whether the number is rounded up or down.
Other rounding methods include: round-towards-zero, where positive numbers are rounded down and negative numbers are rounded up; round-towards-positive-infinity, where both positive and negative numbers are rounded up; and round-towards-negative-infinity, where both positive and negative numbers are rounded down.
Rounding often needs to be performed after an arithmetic operation, particularly but not exclusively multiplication. For example, two binary numbers may need to be multiplied and the result rounded. The instruction used to perform this operation is known as a MULFRAC instruction. FIG. 1 shows a typical arithmetic unit 100 for performing multiplication and rounding. Two 32-bit operands 102 and 104 are input to a multiplier array 106. The multiplier array 106 encodes the operands and produces partial products, which are then summed together, as is known in the art. The multiplication of two 32-bit numbers results in a 64-bit number. An example multiplier array may comprise a Booth recoder for encoding the operands and producing the partial products, and a Wallace tree for summing the partial products.
The use of a Booth recoder reduces the number of terms representing the operand. For example a 32-bit number may be reduced to 17 terms or fewer by a Booth recoder. The partial products are generated by multiplying the second operand by each of the Booth recoded terms to produce a partial product term. Therefore, if a 32-bit number is multiplied by an operand that has been Booth recoded to 17 terms, then 17 64-bit partial products are generated. These 17 partial products are then summed together. If a Wallace tree is used to sum the partial products then this produces 64 sum bits and 63 carry bits. These two sets of bits are shown at the output of the multiplier array 106.
The sum bits and the carry bits are then added together by an adder 108 to produce the final result of the multiplication. The final result is a 64-bit number. This number then needs to be rounded.
A round-to-nearest operation can be performed by adding the decimal value 0.5 to the result and removing the fractional part of the number. For example, if the decimal value of the multiplication result was 2.467, then this should be rounded to 2. By adding 0.5 to 2.467 the value is 2.967, and the integer part of the number is 2. If the decimal value of the multiplication result was 2.671, then this should be rounded to 3. By adding 0.5 to 2.671 the value is 3.171, and the integer part of the number is 3.
This can be performed on the binary number by having knowledge of the location of the radix point in the number. For example, if the radix point is between bits 30 and 31 in the multiplication result, then it is known that bit 31 represents the value 1 in decimal, and bit 30 represents the value 0.5 in decimal. Therefore, by adding a “1” bit to bit position 30 in the result, then the decimal value 0.5 is added to the result. This method works for both unsigned binary numbers and signed numbers using 2's complement arithmetic.
This operation is achieved in FIG. 1 by splitting the result of the multiplication into two groups of bits, as shown at the output of the adder 108. The first group of bits corresponds to bits 29 to 0 of the result. This group does not need to be processed further and is provided to the output of the system 100. The second group of bits corresponds to bits 63 to 30. These bits are input to an incrementer 110. The incrementer 110 increments the bits by one. Since bit 30 is the least significant bit provided to the incrementer 110, this is the equivalent of adding a “1” bit to bit position 30, and therefore adds 0.5 in decimal to the result. The outputs of the incrementer 110 are the incremented bits 63 to 30, which can then be joined to the remaining bits 29 to 0 to provide the overall rounded result.
The problem with the system shown in FIG. 1 is that additional logic delay is introduced into the system through the incrementer 110. The amount of logic delay introduced can be reduced by using a known system such as that shown in FIG. 2. This system 200 takes the same operands 102 and 104 and these are multiplied in the same multiplier array 106 as in FIG. 1. The output of the multiplier array 106 is input to a full adder block 202. Also input to the full adder block 202 is a binary number which is comprised of all zeroes except for a “1” at bit position 30. This can be represented as 0x0000000040000000 in hexadecimal. When the bits are summed in the full adder block 202 a “1” bit is added to bit position 30, which performs the rounding as described previously. The output of the full adder is then input to an adder 204 (similar to the adder 108 in FIG. 1) where the sum and carry bits are added to produce the final result.
As the delay through the full adder block 202 is less than that through the incrementer, a reduced amount of logic delay is introduced compared to the system shown in FIG. 1. However, performing the rounding process using the full adder block does nevertheless incur a logic delay.
FIG. 3 shows a known arithmetic unit for implementing the round-to-nearest-even scheme. As mentioned above, round-to-nearest-even is used where the number ends in exactly x.5. This can be slow to implement as the decision on whether to use the round-to-nearest-even scheme is only made once the value of the number is known, i.e. whether it ends in x.5.
The unit 300 shown in FIG. 3 performs a similar multiplication operation to that shown in FIG. 1. Two 32-bit operands 102 and 104 are input to a multiplier array 106, which outputs the sum and carry bits as described previously. The sum and carry bits are input to an adder 302, and these are summed to produce the result of the multiplication. The result of the multiplication then needs to be rounded, using the round-to-nearest-even scheme if applicable.
The decision on whether to use the round-to-nearest-even operation can be made by observing one bit to the left of the radix point and all the bits to the right of the radix point. For example, if the radix point is between bits 30 and 31 in the multiplication result, then bit 31 represents the value 1 in decimal. Furthermore, if bit 31 is set to “1” then it means that the number is odd, and if it is set to “0” then the number is even. Bit 30 represents the value 0.5 in decimal. Therefore, if bit 30 is set then the number may end in x.5. Bits 29 to 0 represent fractions less than 0.5, specifically ¼, ⅛, 1/16, . . . , 1/2147483648 for bits 29, 28, 27, . . . , 0. Therefore, if bit 30 is set and all of bits 29 to 0 have the value zero, then the result ends in exactly x.5. If, however, any of bits 29 to 0 do not have the value zero, then the result does not end in exactly x.5.
The round-to-nearest-even operation then operates as follows. If all of bits 29 to 0 have the value zero and bit 30 is set, then the result is not rounded up if bit 31 is zero (i.e. the number is even). In other words, no rounding bits are added. In summary, if the number is of the form xxx . . . xx0.100 . . . 000 in binary (where x can be either a “1” or “0”) then the number is not rounded.
However, if the result is not of the form xxx . . . xx0.100 . . . 000 then the result is always rounded up. For example, if all of bits 29 to 0 have the value zero and bit 30 is set, but bit 31 is set (i.e. the number is odd) then the number is rounded up. The rounding up is performed by adding a “1” bit into bit position 30 (i.e. the decimal equivalent of adding 0.5) as discussed with regards to FIGS. 1 and 2. Since it is known that the number is an exact multiple of 0.5 this increases the number to the next highest integer. In summary, if the number is of the form xxx . . . xx1.100 . . . 000 in binary (where x can be either a “1” or “0”) then the number is rounded up by adding a “1” in bit position 30 to the right of the radix point.
Furthermore, if any of bits 29 to 0 do not have the value zero, then the number is not an exact multiple of 0.5, and then rounding is performed by adding a “1” bit to bit position 30, as was discussed with regards to FIGS. 1 and 2.
Referring back to FIG. 3, the output of the adder 302 is split into two sets of bits. The first set comprises bits 31 to 0. Of these bits, bits 29 to 0 are provided to the final output of the system 300. Bits 31 to 0 are also input to an examination block 304. The examination block 304 performs the decision discussed above. In particular, bits 29 to 0 are logically ORed together, to determine if they all have the value zero. The result of the OR of bits 29 to 0 is referred to as the “sticky bit”. If the sticky bit is “0” and bit 30 is “1”, then the round-to-nearest-even scheme is applied. Bit 31 is then examined to see if the result should be rounded up or not.
The output of the examination block 304 is a signal that is provided to an incrementer 306. The incrementer 306 takes as input the other set of bits from the adder 302. This set of bits comprises bits 63 to 30. The incrementer 306 performs the same operation as the incrementer in FIG. 1, in that it adds “1” to bit position 30. However, it only increments the bits if it is signalled to do so from the examination block 304. Therefore, if the examination block 304 determines that the number is an exact multiple of 0.5 and is even, then it signals to the incrementer 306 to not round the result. Otherwise, it signals to the incrementer 306 to round the result. The output of the incrementer 306 is the bits 63 to 30, which may be combined with the bits 29 to 0 and provide the output of the system 300.
The system shown in FIG. 3 can therefore implement the round-to-nearest-even operation, but does so at the cost of extra logic delay. It can be seen that two logic elements are needed following the adder 302. The operation of the examination block 304, in particular the determination of the sticky bit, adds to the logic delay.
It can therefore be seen that there is a need for a method to implement rounding schemes such as round-to-nearest and round-to-nearest-even as part of an arithmetic (particularly multiplication) operation, without incurring extra logic delay.