1. Field of the Invention
Embodiments of the present invention relate generally to computer arithmetic and more specifically to pipelined integer division using floating-point reciprocal.
2. Description of the Related Art
A typical computer system uses at least one central processing unit (CPU) to execute programming instructions associated with the specified function of the computer system. The programming instructions include, without limitation, data storage, data retrieval, branching, looping, and arithmetic operations. In order to optimize program execution performance, many conventional CPUs incorporate dedicated hardware resources that can efficiently perform frequently encountered arithmetic operations, such as integer addition (subtraction) and multiplication, which have an important impact on overall performance. Integer division, however, is used infrequently enough that most processor designers choose to avoid the expense of dedicated hardware resources. In such cases, integer division is typically provided by a performance optimized software implementation.
Certain advanced computer systems augment the processing capability of a general purpose CPU with a specialty processor, such as a graphics processing unit (GPU). Each GPU may incorporate one or more processing units, with higher performance CPUs having 16 or more processing units. GPUs and CPUs are generally designed using similar architectural principles, including a careful allocation of hardware resources to maximize performance while minimizing cost. Furthermore, the arithmetic operations typically selected for execution on dedicated GPU hardware resources tend to mirror the arithmetic operations executed on dedicated CPU hardware resources. Thus, similar to many CPUs, integer division, which is less frequently used in GPU applications, is typically implemented in software for execution on the GPU.
When performing software-based integer division operations, the operations may be performed by software executing integer instructions or a combination of integer and floating-point instructions. For example, the classical shift-and-subtract algorithm using integer machine instructions typically computes no more than one result bit per step, where each step typically includes one to three machine instructions, depending on machine architecture. One solution to improve integer division performance uses one floating-point reciprocal (1/x) function to implement integer division, provided the bit-width of the floating-point mantissa is larger than the bit-width of the integer being processed. However, the standard single-precision floating-point mantissa is only 24-bits, whereas the bit-width of an integer value is typically 32-bits, precluding the use of this approach on most common processors. Another class of solution uses specialty arithmetic operations, such as a floating-point fused-multiply-add (FMA), to facilitate integer division. However, these arithmetic operations are typically not supported by the dedicated hardware resources found on conventional processors, such as commonly available CPUs and GPUs, thereby restricting the usefulness of this class of solution.
As the foregoing illustrates, what is needed in the art is a technique for performing integer division operations in software that uses the hardware resources available on conventional processors more efficiently than prior art approaches.