1. Technical Field
The present invention is related generally to the architecture and instruction sets of processors, such as a microprocessor, microcontroller, or digital signal processor. More specifically, the present invention is directed to a processor having efficient instructions for estimating the values of certain floating-point functions.
2. Background Art
Many processor architectures, such as the PowerPC™ processor architecture, support estimate instructions for reciprocal and reciprocal square root as an extension of a fused multiply-add floating-point unit (FPU). For such estimate instructions the primary design goals are twofold: The estimate should be of a relatively high precision, so that with one iteration step of a numerical approximation algorithm, such as Newton-Raphson, one can get to full single precision accuracy or at least close to full precision. It should be possible to implement the estimate instructions with little hardware overhead and with little impact on the processor's cycle time and pipeline structure. In particular, the design should not increase the pipeline depth of the FPU for any non-estimate instruction.
There are a number of different ways in which such an estimate instruction might be implemented. One way is to simply look up the estimate in a table. The usefulness of this technique, however, is limited, since the level of precision available is limited by the size of the table. To achieve a desirable level of accuracy, a very large table would be needed (which would be expensive in terms of the hardware needed to store the table).
A conventional implementation for such estimate instructions therefore consists of two steps: First, a table lookup provides a base value and a slope. Then, the base and slope values are used to linearly interpolate an estimate with the desired precision. Since the table lookup is followed by an interpolation step, the results of the table lookup can have a low precision, and therefore the required table is much smaller than would be necessary for a direct table lookup without interpolation.
In this two-step procedure, the interpolation can either be executed using the general-purpose FPU hardware of the processor or by adding specialized hardware for computing the interpolation. When the general-purpose FPU datapaths are used, the estimate instruction turns out to have a longer latency than a basic fused multiply-add instruction. That adds complexity to the processor's control logic, since it means that the latency of the FPU will vary according to the instruction type. Some existing implementations avoid this complexity at the expense of performance by assuming a single FPU latency and stalling the execution for the additional cycles while executing an estimate instruction. Furthermore, the longer latency can cause significant hardware overhead in the instruction issue and dependency check hardware.
As suggested above, the interpolation step does not require a full general-purpose FPU. Instead, it can be executed with a multiplier of reduced size, an adder, and some additional logic. With this specialized hardware, the interpolation step can be processed much more quickly than with a general-purpose FPU, i.e., the latency of the estimate instruction approaches that of a regular multiply-add instruction. The obvious drawback of this solution is the extra hardware required to speed-up the interpolation step.
What is needed, therefore, is a processor design in which floating-point function estimate instructions can be implemented without incurring significant costs in terms of performance and hardware complexity. The present invention provides a solution to these and other problems, and offers other advantages over previous solutions.