1. Field of the Invention
The present invention relates to computational and calculation functional units of computers, controllers and processors. More specifically, the present invention relates to functional units that execute square root and reciprocal square root operations.
2. Description of the Related Art
Computer systems have evolved into versatile systems with a vast range of utility including demanding applications such as multimedia, network communications of a large data bandwidth, signal processing, and the like. Accordingly, general-purpose computers are called upon to rapidly handle large volumes of data. Much of the data handling, particularly for video playback, voice recognition, speech process, three-dimensional graphics, and the like, involves computations that must be executed quickly and with a short latency.
One technique for executing computations rapidly while handling the large data volumes is to include multiple computation paths in a processor. Each of the data paths includes hardware for performing computations so that multiple computations may be performed in parallel. However, including multiple computation units greatly increases the size of the integrated circuits implementing the processor. What are needed in a computation functional unit are computation techniques and computation integrated circuits that operate with high speed while consuming only a small amount of integrated circuit area.
Execution time in processors and computers is naturally enhanced through high speed data computations, therefore the computer industry constantly strives to improve the speed efficiency of mathematical function processing execution units. Computational operations are typically performed through iterative processing techniques, look-up of information in large-capacity tables, or a combination of table accesses and iterative processing. In conventional systems, a mathematical function of one or more variables is executed by using a part of a value relating to a particular variable as an address to retrieve either an initial value of a function or a numeric value used in the computation from a large-capacity table information storage unit. A high-speed computation is executed by operations using the retrieved value. Table look-up techniques advantageously increase the execution speed of computational functional units. However, the increase in speed gained through table accessing is achieved at the expense of a large consumption of integrated circuit area and power.
Two instructions that are highly burdensome and difficult to implement in silicon are a square root instruction and a reciprocal square root operation, typically utilizing many clock cycles and consuming a large integrated circuit area. For example, the square root and the reciprocal square root often have execution times in the range of multiple tens of clock cycles.
For example one technique for computing a square root function or an inverse square root function is to utilize the iterative Newton-Raphson method using a seed value of an approximate value accessed from a lookup table. Hardware for computing the square root or inverse square root includes a multiply/add unit. The iterative technique includes multiple passes through the multiply/add unit. Computation units utilizing the Newton-Raphson method typically take many clock cycles to perform square root and inverse square root operations.
What are needed are a technique for executing square root and reciprocal square root operations, and a computation unit that implements the technique that efficiently execute the operations quickly in a reduced number of clock cycles using a reduced integrated circuit area.
A method of computing a square root or a reciprocal square root of a number in a computing device uses a piece-wise quadratic approximation of the number. The square root computation uses the piece-wise quadratic approximation in the form:
squareroot(X)={overscore (A)}ix2+{overscore (B)}ix+{overscore (C)}i,
in each interval i.
The reciprocal square root computation uses the piece-wise quadratic approximation in the form:
1/squareroot(X)=Aix2+Bix+Ci,
in each interval i.
The coefficients {overscore (A)}i, {overscore (B)}i, and {overscore (C)}i, and Ai, Bi, and Ci are derived for the square root operation and for the reciprocal square root operation to reduce the least mean square error using a least squares approximation of a plurality of equally-spaced points within an interval. In one embodiment, 256 equally-spaced intervals are defined to represent the 23 bits of the mantissa. The coefficients are stored in a storage and accessed during execution of the square root or reciprocal square root computation instruction.
In a floating point square root or reciprocal square root computation, the value X designates the mantissa of a floating point number and x designates lower order bits of the mantissa. The technique includes accessing the {overscore (A)}i, {overscore (B)}i, and {overscore (C)}i coefficients or Ai, Bi, and Ci coefficients from storage and computing the value {overscore (A)}ix2+{overscore (B)}i x+{overscore (C)}i or Aix2+Bix+Ci. While computing the square root or reciprocal square root of the floating point number, the exponent of the result is shifted right. To avoid an error that occurs when an odd exponent is shifted right, dropping a xe2x80x9ccarryxe2x80x9d bit, the computed result is multiplied by a correction constant designating a value 20.5 or xc2xd0.5.
Several operations are performed in executing an embodiment of the computation method. In multiple data paths performing a plurality of operations in parallel, the coefficients are accessed from storage during calculation of the squared term of the lower order bits x. In a subsequent cycle, two multipliers are employed to calculate the {overscore (A)}ix2 or Aix2 term and the {overscore (B)}ix or Bix term. In a further subsequent cycle the {overscore (A)}ix2 or Aix term, the {overscore (B)}ix or Bix term, and the {overscore (C)}i or Ci coefficient are summed to form an approximation result while the exponent of the floating point number is shifted right and corrected for special value cases. In a subsequent cycle, the approximation result is multiplied by a correction constant designating a value 20.5 or xc2xd0.5.
In accordance with an embodiment of the present invention, a computation unit includes a multiplier and an adder and accesses a storage storing coefficients for computing a piece-wise quadratic approximation. The computation unit further includes a controller controlling operations of the multiplier and adder, and controlling access of the coefficient storage.