As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processor cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipe lining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multi threading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
One area where parallelism continues to be exploited is in the area of execution units, e.g., fixed point or floating point execution units. Many floating point execution units, for example, are deeply pipeline. However, while pipe lining can improve performance, pipe lining is most efficient when the instructions processed by a pipeline are not dependent on one another, e.g., where a later instruction does not use the result of an earlier instruction. Whenever an instruction operates on the result of another instruction, typically the later instruction cannot enter the pipeline until the earlier instruction has exited the pipeline and calculated its result. The later instruction is said to be dependent on the earlier instruction, and phenomenon of stalling the later instruction waiting for the result of an earlier instruction is said to introduce “bubbles,” or cycles where no productive operations are being performed, into the pipeline.
Dependencies have been found to adversely affect a number of different types of programs that are executed by an execution unit. For example, refinement algorithms that operate iteratively to calculate the result of a mathematical function often incorporate dependencies that can limit the performance of such algorithms. An iterative refinement algorithm, which may be used to find the result of a number of different types of mathematical functions, repetitively performs mathematical calculations that approximate a given mathematical function over multiple iterations to progressively approach, or converge to, the desired result with a required accuracy. One common iterative refinement algorithm is the “Newton-Raphson” method, which involves approximating a function at its tangent line to the previous approximation. The derivation is shown below:
      slope    ⁢                  ⁢    of    ⁢                  ⁢          f      ⁡              (                  x          n                )              =                    Δ        ⁢                                  ⁢        y                    Δ        ⁢                                  ⁢        x              =                            f          ′                ⁡                  (                      x            n                    )                    =                                    f            ⁡                          (                              x                n                            )                                -                      f            ⁡                          (                              x                                  n                  +                  1                                            )                                                            x            n                    -                      x                          n              +              1                                          where n is the iteration number, f(x) is the function desired, and f′(x) is the first derivative of that function.
The Newton-Raphson method is often used to find the reciprocal of a number, since fully accurate reciprocal functions are often costly to implement in hardware due to their long latency, complexity and large circuit area. Plugging the reciprocal function into this equation becomes:
      -          1              B        2              =            (                        1          b                -                  1          B                    )              (              b        -        B            )      where B is the value passed into the reciprocal function and b is its approximation. This reduces to:
      1    B    =                    -                  B                      b            2                              +              2        b              =                            1          b                ⁢                  (                      1            -                          B              b                                )                    +              1        b            
Table I below illustrates exemplary POWERPC assembly code for implementing this method over three iterations, where B is the operand of the reciprocal function, and rn is the result of the reciprocal function, with increasing numbers denoting higher accuracy with each iteration:
TABLE INewton-Raphson POWERPC Assembly Codefresr0, B# r0 = estimate 1/Bfnmsube0, r0, B, one# e0 = 1 − (B * r0)fmaddr1, r0, e0, r0# r1 = r0 * e0 + r0fnmsube1, r1, B, one# e1 = 1 − (B * r1)fmaddr2, r1, e1, r1# r2 = r1 * e1 + r1fnmsube2, r2, B, one# e2 = 1 − (B * r2)fmaddr3, r2, e2, r2# r3 = r2 * e2 + r2
It should be noted that, in each iteration, the fmadd instruction is dependent upon the fnmsub instruction, because the value for e0, which is calculated by the fnmsub instruction, must be calculated before it can be used as an input to the fmadd instruction. Consequently, each fmadd instruction is required to stall until the result of the immediately preceding fnmsub instruction is available. Similarly, each fnmsub instruction is dependent upon either the fres instruction (for the first iteration) or the fmadd instruction from the preceding iteration due to the use of the result of the prior iteration in the calculations for the next iteration. In a multi-stage execution pipeline that requires a dependent instruction to start executing no earlier than the fourth cycle after its previous instruction, as an example, each iteration of the algorithm may therefore introduce as many as four bubbles in the pipeline, delaying the completion of the algorithm and reducing the processing efficiency of the execution unit.
Often compounding the performance problem raised by dependencies, in the Newton-Raphson method, as well as in other iterative refinement algorithms, a result sometimes may be obtained that has reached the desired accuracy before the maximum number of iterations have completed. Tables II and III below, for example, present two simplified examples that use the Newton-Raphson method to find the reciprocal of a double precision floating point number. In these examples, fres, the POWERPC floating point reciprocal estimate function, is assumed to be a 10 bit accuracy version, while fdiv is the POWERPC floating point divide function, illustrating the value to which the algorithm is attempting to converge:
TABLE IINewton-Raphson Example AB = 1.019 = 0x3FF04DD2F1A9FBE71/B= fdiv(1,B)      3FEF67411155AB17 0.981354r0 = fres(B) =3F7B40003F826E98 0.006653 (1/B)t  = fnmsub(r0,B,1) =BF1851EB851E9DB0 −0.0000927734 (1−(B * r0))r1 = fmadd(r0,t,r0) =3FEF67410CCCCCCE 0.981354 (r0 * t + r0)e1 = fnmsub(r1,B,1) =3E427BB2FD3570E1 0.000000 (1 −(B * r1))r2 = fmadd(r1,e1,r1) =3FEF67411155AB16 0.981354 (r1 * e1 + r1)e2 = fnmsub(r2,B,1) =3C95F5416ADC1A4C 0.000000 (1 −(B * r2))r3 = fmadd(r2,e2,r2) =3FEF67411155AB17 0.981354 (r2 * e2+ r2)
TABLE IIINewton-Raphson Example BB = 1.02 = 0x3FF051EB851EB8521/B= fdiv(1,B) =    3FEF5F5F5F5F5F5F 0.980392r0 = fres(B) =3F7B00003F828F5C 0.006592 (1/B)t  = fnmsub(r0,B,1) =BF147AE147AE1980 −0.0000781250 (1− (B * r0))r1 = fmadd(r0,t,r0) =3FEF5F5F5C28F5C2 0.980392 (r0 * t + r0)e1 = fnmsub(r1,B,1) =3E3A36E2EE6CD33A 0.000000 (1− (B * r1))r2 = fmadd(r1,e1,r1) =3FEF5F5F5F5F5F5F 0.980392 (r1 * e1 + r1)e2 = fnmsub(r2,B,1) =3C7C8FC2F6295C90 0.000000 (1 − (B * r2))r3 = fmadd(r2,e2,r2) =3FEF5F5F5F5F5F5F 0.980392 (r2 * e2+ r2)
Table II shows an example where, in order to achieve the desired accuracy, three iterations of the method are needed. It should be noted, however, that for Example B in Table III, the desired accuracy is achieved after only two iterations. As a result, if the algorithm is executed through the full three iterations, the result of the algorithm is still not available until completion of all three iterations. In addition, the last iteration still introduces the aforementioned dependencies, thus further delaying the completion of the algorithm.
In the situation where a desired result is reached in less than the full number of iterations, an opportunity exists for an “early exit” to the algorithm. However, in many conventional microprocessor designs, the algorithm is used in microcode or in a sequencer unit to perform division. Oftentimes even if an early exit condition is possible, the procedure isn't designed to handle them because methods such as including compares and branches in the routine cause too much complexity or too much cycle time overhead, causing the performance of the overall routine to drop. Particularly in scenarios where it is known that the desired accuracy can be achieved in three or four iterations in most if not all cases, the overhead associated with comparing and branching out of a loop prematurely exceeds the potential benefit of supporting an early exit from the routine.
Consequently, a need exists in the art for a manner of improving the performance of iterative refinement algorithms, and in particular, for a manner of improving the performance of iterative refinement algorithms executed by execution units having multi-stage execution pipelines.