1. Field of the Invention
The present invention relates to computational and calculation functional units of computers, controllers and processors. More specifically, the present invention relates to functional units that execute square root and reciprocal square root operations.
2. Description of the Related Art
Computer systems have evolved into versatile systems with a vast range of utility including demanding applications such as multimedia, network communications of a large data bandwidth, signal processing, and the like. Accordingly, general-purpose computers are called upon to rapidly handle large volumes of data. Much of the data handling, particularly for video playback, voice recognition, speech process, three-dimensional graphics, and the like, involves computations that must be executed quickly and with a short latency.
One technique for executing computations rapidly while handling the large data volumes is to include multiple computation paths in a processor. Each of the data paths includes hardware for performing computations so that multiple computations may be performed in parallel. However, including multiple computation units greatly increases the size of the integrated circuits implementing the processor. What are needed in a computation functional unit are computation techniques and computation integrated circuits that operate with high speed while consuming only a small amount of integrated circuit area.
Execution time in processors and computers is naturally enhanced through high speed data computations, therefore the computer industry constantly strives to improve the speed efficiency of mathematical function processing execution units. Computational operations are typically performed through iterative processing techniques, look-up of information in large-capacity tables, or a combination of table accesses and iterative processing. In conventional systems, a mathematical function of one or more variables is executed by using a part of a value relating to a particular variable as an address to retrieve either an initial value of a function or a numeric value used in the computation from a large-capacity table information storage unit. A high-speed computation is executed by operations using the retrieved value. Table look-up techniques advantageously increase the execution speed of computational functional units. However, the increase in speed gained through table accessing is achieved at the expense of a large consumption of integrated circuit area and power.
Two instructions that are highly burdensome and difficult to implement in silicon are a square root instruction and a reciprocal square root operation, typically utilizing many clock cycles and consuming a large integrated circuit area. For example, the square root and the reciprocal square root often have execution times in the range of multiple tens of clock cycles.
For example one technique for computing a square root function or an inverse square root function is to utilize the iterative Newton-Raphson method using a seed value of an approximate value accessed from a lookup table. Hardware for computing the square root or inverse square root includes a multiply/add unit. The iterative technique includes multiple passes through the multiply/add unit. Computation units utilizing the Newton-Raphson method typically take many clock cycles to perform square root and inverse square root operations.
What are needed are a technique for executing square root and reciprocal square root operations, and a computation unit that implements the technique that efficiently execute the operations quickly in a reduced number of clock cycles using a reduced integrated circuit area.
A parallel fixed-point square root and reciprocal square root computation uses the same coefficient tables as the floating point square root and reciprocal square root computation by converting the fixed-point numbers into a floating-point structure with a leading implicit 1. The value of a number X is stored as two fixed-point numbers. In one embodiment, the fixed-point numbers are converted to the special floating-point structure using a leading zero detector and a shifter. Following the square root computation or the reciprocal square root computation, the floating point result is shifted back into the two-entry fixed-point format. The shift count is determined by the number of leaded zeros detected during the conversion from fixed-point to floating-point format.
The parallel fixed-point square root and reciprocal square root computation includes several operations. The fixed-point values are normalized into a floating point format include a mantissa and an exponent. The coefficients B and C read are accessed from the storage storing the A, B, and C coefficients used in the floating point computation for both fixed point values. Values Di are computed by multiplying to obtain the values Bix+Ci for both of the fixed point numbers. The values Di are shifted right based on the value of the exponent and placed in fixed-point format.
In accordance with an embodiment of the present invention, a computation unit includes a multiplier and an adder and accesses a storage storing coefficients for computing a piece-wise quadratic approximation. The computation unit further includes a controller controlling operations of the multiplier and adder, and controlling access of the coefficient storage.
In one aspect the storage stores coefficients that are used both for a floating point square root and reciprocal square root computations, and for a parallel fixed-point square root and reciprocal square root computation.
Many advantages are gained by the described method and computation unit. The use of storage devices such as read-only memory (ROM) for storing the coefficients increases the speed of computation. Utilization of the same storage and the same coefficients for both the floating point reciprocal square root computation and the parallel fixed-point reciprocal square root computation efficiently halves the amount of storage allocated to the computations.