Field
In one aspect, the following relates to processor microarchitecture, and in a more particular aspect, to implementations of pipelined execution resources, such as multipliers.
Related Art
An important aspect of computation is the ability to perform arithmetic. Processors, such as generally programmable processors, digital signal processors, and graphics processors, generally have a capability to perform arithmetic, such as one or more of integer, fixed, and floating point arithmetic. The performance of such arithmetic can be controlled by arithmetic instructions, which can vary among different architectures, but for the sake of example, can include add, subtract, multiply, divide, and square root instructions. A particular implementation of such instructions may involve decomposing such operations into operations that are supported on a particular hardware implementation. For example, a particular implementation may not have separate circuitry implementing a floating point multiplier (or more generally, a floating point math unit), and as such, a floating point multiply instruction may be implemented by emulating the instruction in microcode on the processor, within the operating system, or in compiled user-space code. Such emulation is always much slower than a dedicated hardware floating point unit. However, a hardware floating point unit can consume a large amount of area, and hence increase cost. Nevertheless, as transistor budgets continue to increase, along with the increased usage of floating point, dedicated hardware for arithmetic including floating point arithmetic also has become more common.
Some kinds of math instructions may be implemented using iterative refinement, so that an intermediate result is refined to a more precise result over multiple passes, until after a certain number of iterations, a result to a required number of bits of precision can be achieved.
As an example, there are several methods used to implement divide and square root functions in computer hardware. One of the most commonly used fast methods is the Newton-Raphson algorithm. For divide, an initial approximation of the reciprocal of the divisor is obtained (e.g., for
      a    b    ,the reciprocal of b is obtained). For square root, an initial approximation of the reciprocal of the input is obtained. Then a multiplier is used repeatedly to obtain approximations with higher accuracy. When sufficient accuracy has been obtained, the final result is determined. The final approximation is multiplied by the divided for divide, and for square root, it is multiplied by the input.
In particular, for divide, an initial approximation, x0, of the reciprocal of the divisor, b, is improved upon by first computing x0*b→t. Then a better approximation, x1 is calculated: x0*(2−t)→x1. Since x0 is an approximation to 1/b, t is close to 1 and so 2−t may be approximated by complementing the bits of t which can be done quickly. In these circumstances, the multiply module is used repeatedly (e.g., for Newton-Raphson, there are two multiplies for each iteration).
For square root, an approximation, x0, of the reciprocal of the square root of the input, b, is improved by first computing x0*b→t, then x0*t→s, and then x0*(3−s)/2→x1, x1 being a better approximation. Since x0 is an approximation to the reciprocal of the square root of b, x0*x0*b is close to 1, so *(3−s)/2 may be obtained quickly in a slightly modified method from what is used for divide. Again, the above shows that the multiply module is used repeatedly.
For example, dividing term A by term B (i.e., A/B) can be performed by finding the reciprocal of term B (1/B) using Newton-Raphson, and then multiplying that reciprocal by term A. Implementations of Newton-Raphson often involve using a LookUp Table (LUT) indexed by a portion of term B to produce an initial approximation of the reciprocal of B. Such initial approximation has relatively few bits of precision, and the number of bits of precision can be doubled for each Newton-Raphson iteration. Thus, for a double precision division, starting from 7 bits of precision, it can be expected that 3 iterations will be required to achieve at least 53 bits of precision for the mantissa of the double precision result.