Floating point performance is a key focus of modern microprocessor architecture. Among the four basic floating-point operations addition, subtraction, multiplication, and division, division is the most resource intensive operation for microprocessing architectures. Recently, advances have been achieved in making very high radix (number base, e.g. 2 (binary), 10 (decimal), 16 (hexadecimal), and the like) digit recurrence algorithms practical to implement. By very high radix, we mean that the number of quotient digits generated by each iteration of the algorithm is much larger than the typical traditional algorithms that yield 1 bit (radix-2), 2 bits (radix-4), 3 bits (radix-8), or 4 bits (radix-16). It is practical for these very high radix division algorithms to generate on the order of 10 bits (radix-1024) or 20 bits (radix-1048576) during an algorithm's iteration. One common drawback of these algorithms, however, is that the internal data width grows in a somewhat unnatural way.
For example, traditional digit recurrence division algorithms have as their central computational step the update of the remainder: Rj+1=r×Rj−qj+1×Y. Here, R is the remainder, r is the radix, qj+1 is the quotient digit, and Y is the divisor (e.g., denominator). The bulk of the work is in computing the product qj+1×Y. The width of Y remains fixed while the width of qj+1 grows with the radix. Generally, the radix is an integral power of 2, so r=2m for some integer m. When this is the case, the multiplier needs to handle an m-by-L multiplication, where L is the data width of the precision in question (e.g., L=53 for Institute of Electrical and Electronics Engineers (IEEE) standards of double precision). In other words, the depth of the multiplier is the number of additional quotient data bits the algorithm generates per iteration, and the width of the multiplier is fixed at the data width of the precision in question.
For traditional digit recurrence division algorithms, only the depth of the multiplier grows with the radix of the algorithm. The unnatural growth with recently developed very-high-radix division algorithms occurs because the multiplier for these operations must instead be able to handle an m-by-(L+m) multiplication. That is, the width of the multiplier grows as well. This requirement is a direct outcome of a crucial “pre-scaling” step (e.g., divisor or denominator reciprocal, discussed in the Detailed Discussion Section below) in this class of algorithms that make them practical to implement. While it is generally accepted that only the depth of the multiplier affects division operation speed, the growth of the width leads, nevertheless, to a number of drawbacks. The more obvious drawbacks are increased space and increased power consumption. The less obvious but ever growing important drawback is the need for a customized multiplier and/or adder rather than those most naturally found in standard cell libraries related to the precision width L in question.
Therefore, there is a need for improved implementations and techniques for radix division. These implementations and techniques should be as fast as some of the recent radix division algorithms, but capable of maintaining the width of the multiplier such that space usage and power consumption is minimized and thus reduced when compared to existing and conventional radix division implementations.