This invention is generally related to high accuracy and low latency techniques for computing mathematical expressions of the form (A/B)K using a data processor having parallel floating-point arithmetic units in hardware.
Most general purpose data processors have only a very basic hardware arithmetic capability, e.g. add, multiply, and divide. Thus, computing a transcendental function such as arctan(x) requires rewriting the function in terms of these basic arithmetic operations, so that the processor can execute the function. Conventional software math libraries contain subroutines, which are typically written in the assembly language of a particular processor, that are optimized to compute a function with high accuracy yet using only the basic arithmetic operations. The methodology of the subroutine is also designed to take advantage of any parallel floating-point processing capability in the processor. For instance, if the rewritten form of the function, sometimes referred to as a series expansion, has multiple instances of the type (A+B), then independent (A+B) parts of the expansion can be placed in two or more instructions that will be executed simultaneously, thereby reducing the latency of computing the function.
Expressions of the form (A/B)K, where A and B are real numbers and K is an integer, often need to be computed as part of software-implemented mathematical functions. However, modern machines such as the ITANIUM processor by Intel Corp. do not support the division operation A/B in hardware. The ITANIUM processor supports fused multiply add (FMA) floating-point operations of the form AB+C. In addition, this processor has multiple floating-point units in hardware for parallel instruction execution, and is an example of an explicit parallel instruction computer (EPIC) in which two floating-point arithmetic operations, two memory access operations, and two integer arithmetic operations can be executed in parallel.
A conventional technique for computing (A/B)K on a machine such as the ITANIUM processor that does not support division may include the following three steps: (1) reciprocal calculation R=1/B, by first using the well-known approximate reciprocal operator R0=frcpa(B)=(1/B)(1+xcex94) and then applying an iterative process to refine the approximation R0 to obtain the needed accuracy in R, (2) quotient calculation Q=AR, and (3) power calculation QK. All three steps can be performed with no divisions, only multiply and add operations.
The overall latency of computing (A/B)K is dominated by the first and third steps, i.e. the reciprocal and power calculation steps. In machines that have parallel floating-point units, the power calculation can be optimized to take advantage of such parallelism. However, before the power calculation can be performed, the reciprocal calculation must first be completed, such that it is said to lie in the xe2x80x9ccritical pathxe2x80x9d of the overall calculation. This conventional reciprocal calculation is very time consuming, particularly because of the complex iterative procedure needed to enhance the accuracy of the approximate reciprocal. Thus, the combination of the conventional reciprocal calculation in the first step and the power calculation in the third step severely limits the ability to shorten the latency of the overall calculation.