The fast and accurate evaluation of algebraic and transcendental functions such as inverse square root, cube root, sine, cosine, exponential and logarithm is crucial in many fields of scientific computing. Algorithms that are most suitable for software implementation on modern computer architectures usually include three stages: Argument Reduction, Core Approximation, and Final Reconstruction. This well-accepted approach is discussed by Cody Jr., William I. and Waite, William, in Software Manual for the Elementary Functions, Prentice Hall, 1980, and by Muller, J.-M, in Elementary functions: algorithms and implementation, Birkhauser 1997. The three-stage approach is best illustrated by a simplified example. Consider the calculation of the exponential function exp(X). The magnitude of the input argument X has a range so large that a simple series expansion in X cannot practically deliver the accuracy required for all X. Using the conventional three-stage approach, exp(X) is calculated as follows:                Argument Reduction: Calculate N: nearest_integer(X/log(2)); R:=X−N×log(2). At the end of this step, |R|≦log(2)/2.        Core Approximation: Instead of having to calculate exp(X) where simple series expansion does not work, exp(R) is calculated using a simple series (polynomial) approximation. A simple series works here because the magnitude of R is limited in range.        Final Reconstruction: The desired value exp(X) is computed based on N and exp(R) using the mathematical relationship:exp(X)=exp(N×log(2)+R)=exp(N×log(2))exp(R)=2Nexp(R).        
On an architecture with abundant parallelism such as found in more recent CPU designs such as but not limited to the Itanium(R) microprocessor available from Intel Corporation, the bottleneck of these three stages is the initial argument reduction stage. The reason is that the reduction stage is usually composed of a sequence of dependent (or serial) calculations where parallelism cannot be exploited. The approximation stage usually consists of evaluation of polynomials for which parallelism can be exploited via well-known methods such as discussed in Knuth, D. E. The Art of Computer Programming vol. 2: Seminumerical Algorithms. Addison-Welsey, 1969. Muller, J.-M. Elementary functions: algorithms and implementation. Birkhauser 1997. The reconstruction step usually consists of simple calculations such as one multiplication, or one multiplication followed by an addition. The components needed for those simple calculations (such as 2N in the exp example above) can be computed during (in parallel with) the approximation stage. The consequence is that in efficient implementations on the commonly encountered algebraic and transcendental functions on systems with parallelism, the argument reduction stage can contribute a considerable percentage of the total latency. This is disproportionate from the perspective that the amount of computational work in the reduction stage is usually a fraction of that involved in the approximation stage. Further, the reduction stage usually requires some constants to be loaded from memory, which also slows down the execution of the algorithm.