This invention relates in general to the field of integrated data processing, and more specifically to an improved method and apparatus for performing division and square root functions in a system using a multiplier and a multipartite table.
The arithmetic unit is one of the most important components of any integrated data processing system. Arithmetic units perform a wide variety of mathematical functions upon operands that are transmitted from other portions of an integrated system. The basic addition, subtraction, and multiplication functions are quickly and efficiently performed in arithmetic units today. Known techniques for performing division and square root functions, however, have not been completely satisfactory with respect to speed and efficiency.
The speed of performing basic mathematical functions is significant to the efficiency of a data processing system, since the system may need to perform basic mathematical functions many thousands of times per second for a variety of common applications. Multimedia applications typically require relatively low precision division and square root operations, while scientific applications require much higher precision for these operations. The IEEE floating point standard specifies accuracy of a 24 bit fraction in a 32 bit word for single precision and a 53 bit fraction in a 64 bit word for double precision for all basic arithmetic operations including division and square root. Reducing the number of cycles needed to perform one division or square root operation reduces the overall required number of cycles, increasing efficiency.
Known methods for performing division and square root operations produce a single precision result after several iterations and a double precision result after additional iterations. These methods typically incur an initial delay for generating a table lookup operand in addition to a sequence of delays due to dependent additions and/or multiplications between successive iterations. Additive methods perform an iteration in one or two cycles, but require tens of cycles for single and double precision results. More aggressive multiplicative methods employ only a handful of iterations, but operation dependencies result in three to five cycles of delay per iteration.
A common method of performing division and square root operations employs SRT procedures in dedicated hardware. This method is used in, for example, the Pentium family of microprocessors manufactured by Intel Corporation. The procedure generates digits iteratively in a small radix similar to long division. The method, due to iterative dependencies, is extremely slow, typically taking some fifteen cycles for single precision and some thirty cycles for double precision to complete a division operation.
Pipelined multipliers can be used to reduce the number of required cycles. Pipelined multipliers complete a multiplication to double precision in three to five cycles. More importantly, pipelined multipliers typically allow a new multiply operation to start every cycle in a procedure analogous to an assembly line. Known division methods, however, require four to ten dependent multiplications for the highest precision result, where a second multiply must wait for the first multiply to finish and feed back the result as a new operand for the subsequent multiply operation. An independent multiplication, in contrast, does not need to wait for the result of a previous multiply operation and can begin at the next cycle. The feedback latency of three to five cycles is the delay time cost for dependent multiplies, compared with just one cycle for independent multiplies.
The convergence methods of division and square root are preferred in a pipelined unit, because they require fewer dependent multiplications. Known convergence methods for division and square root start from a table lookup approximation of the reciprocal of the denominator (for division) or of the square root reciprocal of the operand (for square root) and perform multiplications to refine the approximation, doubling the accuracy with each iteration.
A known convergence technique for division uses a single lookup table and employs three iterations to refine the approximation of a quotient x/y. First, a table lookup value txc2x71/y is determined, then the numerator x and the denominator y of quotient q=x/y are scaled by the table lookup value, yielding:             x      y        ·          t      t        =            q      7.5              (              1        -        ϵ            )      
where *xcex5* # 2xe2x88x927.5, and qn is an approximation of q=x/y to an n-bit accuracy. Here, the accuracy is determined by the table lookup value. Note that the multiplication operation is delayed by the latency of the table lookup value. Second, the result of the first approximation is scaled by (1+xcex5):                     q        7.5                    (                  1          -          ϵ                )              ·                  (                  1          +          ϵ                )                    (                  1          +          ϵ                )              =            q      15              (              1        -                  ϵ          2                    )      
where (1+xcex5) is the 2""s complement of the scaled y value from the previous iteration, that is, (1+xcex5)=2xe2x88x92yt. In the third and final iterations, the approximations are scaled two more times:                     q        15                    (                  1          -                      ϵ            2                          )              ⁢                                                      ·                                                (                                      1                    +                                          ϵ                      2                                                        )                                                  (                                      1                    +                                          ϵ                      2                                                        )                                                      =                        ⁢                                                            q                  30                                                  (                                      1                    -                                          ϵ                      4                                                        )                                            ·                                                (                                      1                    +                                          ϵ                      4                                                        )                                                  (                                      1                    +                                          ϵ                      4                                                        )                                                                                                      =                        ⁢                                          q                60                                            (                                  1                  -                                      ϵ                    8                                                  )                                                          ⁢      xe2x80x83  
where (1+xcex52) and (1+xcex54) are the 2""s complement of the scaled y value from the previous iteration, that is:
(1+xcex52)=2xe2x88x92[yt(2xe2x88x92yt)]
and
(1+xcex54)=2xe2x88x92[yt(2xe2x88x92yt)][2xe2x88x92yt(2xe2x88x92yt)]
Note that the accuracy, xe2x80x9cnxe2x80x9d, doubles with each iteration.
The value q30 determined after the third iteration is sufficient for single precision. The final denominator scaling (1xe2x88x92xcex52)xc3x97(1+xcex52)=(1xe2x88x92xcex54) for single precision or (1xe2x88x92xcex54)xc3x97(1+xcex54)=(1xe2x88x92xcex58) for double precision need not be computed, but this does not lessen the number of dependent multiplies. Well known rounding procedures are then applied to q30 or q60 to determine the final result with several additional cycles delay.
Convergence division implementations typically employ three to four iterations to obtain the desired accuracy, starting with a fairly small table with, for example, a seven to nine bit accuracy. Use of larger tables becomes impractical as the table size grows exponentially with the number of bits. Table computations, such as linear interpolation, allow a larger starting accuracy and a reduced number of iterations at the cost of the multiple cycle delay of the table lookup computations. Known techniques for conversion division still take 15 to 30 cycles due to the succession of pipeline delays for dependent multiplication.
Therefore, a need has arisen for a system that performs division and square root functions and yields approximate quotients and square roots of accuracy sufficient for known single and double precision rounding procedures, but which substantially reduces the number of cycles lost to dependencies between the table lookup and successive multiplications.
In accordance with the present invention, a method and system for performing division and square root functions are provided which substantially eliminate or reduce the disadvantages and problems associated with prior arithmetic techniques used to perform division and square root functions.
The division and square root systems of the present invention comprise a multiplier. The systems also comprise a multipartite table system, a folding inverter, and a complement inverter, each coupled to the multiplier. More specifically, the multipartite lookup table may comprise a tripartite lookup table. The present invention may further comprise data links coupling the parts of the system and a rounder that rounds the final approximation.
According to one embodiment of the present invention, the division and square root functions can be performed using three scaling iterations. The system first determines both the first and the second scaling values. The first scaling value is a folded or semi-complement value of an input denominator (for division) or operand (for square root) y, hereinafter referred to as the xe2x80x9csemi-complement termxe2x80x9d, which is obtained directly from the bit representation of y and thus available in the first cycle after a single logic level. The second scaling value is the value obtained from the table system, hereinafter referred to as the xe2x80x9ctable lookup valuexe2x80x9d. In the first iteration, the system scales the input operand by the semi-complement term. Second, the resulting approximation is scaled by a function of the table lookup value. Third, the approximation is scaled by a value obtained from a function of the semi-complement term and the table lookup value. After the third iteration, the approximation is available for rounding.
In one embodiment of a square root system, the multipartite table may be cut in half, such that for certain denominator values the result of the second iteration is multiplied by an appropriate binade value.
An important technical advantage of the present invention is that it performs the table lookup function concurrently with the first scaling of the numerator and the denominator. That is, the present invention can perform the first scaling step without waiting for the result of the table lookup. Another important technical advantage of the present invention inheres in the fact that it uses a multipartite lookup table. Fast computation of quotient and square root values to double extended precision requires good starting approximations, such that only one or two convergent iterations need to be performed. Good starting approximations can only be obtained by refining the table lookup values, since the size of a direct lookup table would be prohibitive for current technologies. The multipartite table forms the table lookup faster as the sum of 2, 3, or 4 table lookup values. The index sizes for the tables are chosen such that the required accuracy is guaranteed while the total table size is minimized.