This invention relates to vector rotators and computers of sine and cosine, especially to high-radix CORDIC vector rotators.
Vector rotation and the computation of sine and cosine (which are reducible to vector rotation) have applications in many areas that are critical to modern technology, such as telecommunications, image processing, radar, and digital signal processing. More specifically, vector rotation is used in such diverse applications as image rotation, Fourier and other Transform computations, modulation and demodulation. For example, in the computation of Discrete Fourier Transforms (including when using the Fast Fourier Transform algorithm), many multiplications of complex numbers are called for. However, each such multiplication is actually a vector rotation, and could be done using less circuit space by using a CORDIC rotator rather than 4 real-number multipliers.
The original CORDIC family of algorithms was discovered by Volder in 1956 and published three years later in the following paper: J. E. Volder, The CORDIC Trigonometric Computing Technique, IRE Transactions on Electronic Computing, EC-8, pp. 330-334, 1959. The CORDIC computer that Voider built computed in radix-2, that is, the convergence rate was 1 bit per iteration, and was used for aircraft navigation. Volder developed algorithms using essentially the same principle for computing many different functions, with vector rotation included.
A particularly simple explanation of the basic, radix-2 CORDIC algorithm, found in Ray Andraka""s paper, xe2x80x9cA Survey of CORDIC Algorithms for FPGAs,xe2x80x9d Sixth International ACM/SIGDA Symposium on FPGA, Feb. 1998, pp. 191-200, runs as follows:
The well-known formulae for vector rotation can be rewritten as:
xxe2x80x2=cos xcex8[xxe2x88x92y tanxcex8]xe2x80x83xe2x80x83(1)
yxe2x80x2=cos xcex8[y+x tanxcex8]xe2x80x83xe2x80x83(2)
where (x, y) and (xxe2x80x2, yxe2x80x2) are the original and the rotated vectors, respectively, and xcex8 is the angle of rotation.
If we restrict the rotation angles xcex8 so that tan xcex8=xc2x12xe2x88x92(ixe2x88x921), for positive integer values of i, then the multiplication by the tangent in equations (1) and (2) is reduced to a shift operation when the numbers are represented in binary. (We assume that numbers are two""s complement numbers.) It turns out that all angles within a certain useful range (that is, approximately [-1.743, 1.743]) can be expressed as a weighted sum of arctans of 2xe2x88x92(ixe2x88x921) for some small set of contiguous positive integers i. In particular, if the weights are all xc2x11 then we can rotate a vector (x,y) by iteratively applying the following formula:
xi=Kixe2x88x921[xixe2x88x921xe2x88x92yixe2x88x921dixe2x88x9212xe2x88x92(ixe2x88x921)]xe2x80x83xe2x80x83(3)
yi=Kixe2x88x921[yixe2x88x921+xixe2x88x921dixe2x88x9212xe2x88x92(ixe2x88x921)]xe2x80x83xe2x80x83(4)
where       K          i      -      1        =            cos      ⁡              (                              tan                          -              1                                ⁢                      2                          -                              (                                  i                  -                  1                                )                                                    )              =          1                        1          +                      2                                          -                2                            ⁢                              (                                  i                  -                  1                                )                                                        
and dixe2x88x921=xc2x11.
We will henceforth refer to the application of this formula as the ith iteration in radix-4 CORDIC. In radix-4 CORDIC, each iteration can be thought of as a simulation of two radix-2 iterations. Therefore we will call the first iteration xe2x80x9citeration number two,xe2x80x9d the second iteration xe2x80x9citeration number four,xe2x80x9d and, in general, the jth iteration xe2x80x9citeration number 2j.xe2x80x9d
In practice, we would like to omit the multiplication with the Kixe2x88x921 factor, in which case we would not be merely rotating the vector, but also amplifying it by a factor of 1/Kixe2x88x921 in iteration i. The total gain for all iterations would be the product of all the Kixe2x88x921""s, and would be a constant for a fixed number of iterations n. As n approaches infinity, this constant gain approaches approximately 1.647. In many applications, this gain does no harm so long as it is constant. And it is a constant for a fixed n (number of iterations), so long as dixe2x88x921=xc2x11.
To apply the above theory to an actual digital apparatus for rotating a vector by a given angle, we use what Volder, in his 1959 referred to earlier in this document, called xe2x80x9crotation-mode CORDIC,xe2x80x9d which requires 3 input numbersxe2x80x94one for each of the two components x0, y0 of the vector to be rotated, and a third number xcex80 between -1.743 and +1.743 for the angle by which the given vector is to be rotated. The equations for the ith iterations for traditional, Volder-style radix-2 rotation-mode CORDIC is thus as follows:
xxe2x88x921=xixe2x88x921xe2x88x92yixe2x88x921dixe2x88x9212xe2x88x92(ixe2x88x921) xe2x80x83xe2x80x83(5)
yxe2x88x921=yixe2x88x921+xixe2x88x921dixe2x88x9212xe2x88x92(ixe2x88x921) xe2x80x83xe2x80x83(6)
xcex8xe2x88x921=xcex8ixe2x88x921xe2x88x92dixe2x88x921 tanxe2x88x921(2xe2x88x92(ixe2x88x921)) xe2x80x83xe2x80x83(7)
where dixe2x88x921=xe2x88x921 if xcex8ixe2x88x921 less than 0 and +1 otherwise.
The choice of dixe2x88x921 at each iteration is to bring the value in the angle accumulator (which was initialized to xcex80, the angle by which the vector is to be rotated) as close to 0 as possible. The idea is that after all the iterations have been performed, that angle would become 0 for the given precision at which the angle accumulator is kept. As a consequence of that angle becoming 0, the given vector will have been rotated by an amount equal to the input angle xcex80. Traditionally we will need as many iterations as there are fraction bits in the angle accumulator. But in practice it is possible to go through fewer iterations, if we accept the resulting imprecision in the total amount of vector rotation according to the 1998 paper by Andraka mentioned earlier in this document. According to that reference, the magnitude converges much faster than the phase, and so in applications in which phase accuracy is not critical (which is not uncommon in telecommunications, for example), only about half the usual number of iterations will be required.
Though simple, the method just explains suffers from 1-bit-at-a-time convergence. That is, for n bits of fractional precision, n iterations are needed (for full accuracy both in phase and magnitude), each involving 3 full-precision addition or subtraction. What seems to hinder Volder""s circuit down is that it is unobvious how to select an dixe2x88x921 without first computing xcex8ixe2x88x921. But improvements are possible, as we will discuss next.
Many researchers and inventors have improved on or extended Volder""s method in various ways over the last few decades. Of these improvements or extensions, one of the most remarkable (and relevant to the result to be presented here) was by Baker, explained in the following paper: P. W. Baker, xe2x80x9cSuggestion for a Fast Sine/Cosine Generator,xe2x80x9d IEEE Transactions on Computers, pp. 1134-1136, Nov. 1976. Stated simply, Baker based his circuit on the observation that after a few initial radix-2 iterations, an entire sequence of dixe2x88x921""s can be predicted at once, allowing the corresponding iterations to be done simultaneously using carry-save adders. However, Baker did not have a solution for the problem of speeding up the initial iterations. Thus improvements are still possible wherein the initial iterations would also be sped up.
In Vitit Kantabutra""s article, xe2x80x9cOn Hardware for Computing Exponential and Trigonometric Functions,xe2x80x9d IEEE Transactions on Computers, 45:3, March, 1996, as well as in Vitit Kantabutra""s U.S. Patent No. 6,055,553, entitled, xe2x80x9cApparatus for Computing Exponential and Trigonometric Functions,xe2x80x9d a new CORDIC variant was presented, wherein 8 iterations are lumped into a single iteration that does not take as long as 8 of the original iterations because of the fast, low-precision arithmetic used. This scheme therefore is able to speed up initial iterations (as well as the latter iterations). Due to the need for circuitry to handle 8 xe2x80x9clogicalxe2x80x9d or original iterations in a single xe2x80x9cphysicalxe2x80x9d or new iteration, that CORDIC variant is suitable for application in very high-density technologies such as custom CMOS VLSI.
When cost is more of concern, then it would be preferred not to lump so many iterations into one new one. Little work has been done in high-radix CORDIC to date. In M. D. Ercegovac, xe2x80x9cRadix-16 Evaluation of Certain Elementary Functions,xe2x80x9d IEEE Transactions on Computers, C-22:6, June, 1973, radix-16 CORDIC algorithms were presented. However, that paper did not include any details on sine and cosine computation, that is, vector rotation. Ercegovac claimed without going into details that the computations of such functions would be possible using his method. However, it is quite unobvious how (or even whether it was at all possible with his method), because the computation of sine and cosine is quite different from the computation of many other functions using CORDIC: when computing sine and cosine, each iteration gives rise to an amplification factor greater than 1 of the vector yi. In the method proposed by Ercegovac, this factor would NOT be a constant, but would depend on the answer digit chosen in each iteration.
To elaborate further, we note that unlike in the particular version of radix-2 CORDIC discussed above, Ercegovac""s method allows an answer digit (which is the equivalent of what we called dixe2x88x921 above) of zero. This would mean no amplification in iteration i. Thus the total amplification for all iterations would no longer be a constant. For radix-4 CORDIC, many different amplification factors are possible, depending on the magnitude of the particular answer digit picked. The problem of non-constant amplification has been a problem that researchers and inventors have had to deal with in radix-2 CORDIC as well as high-radix CORDIC. In the former case, this problem only occurs if an iteration without any rotation is allowed, that is dixe2x88x921=0.
To further illustrate the unobviousness of how to perform high-radix CORDIC vector rotation without the problem of non-constant amplification, we next consider more recent prior art than Ercegovac""s paper.
E. Antelo, et al. xe2x80x9cHigh-Performance Rotation Architecture,xe2x80x9d IEEE Transactions on Computers, 46:8, Aug. 1997, designed a family of radix-4 CORDIC rotators. However, their rotators yield non-constant gains to the vector magnitude which must be multiplied by the reciprocal to the respective gain before the final answer is ready. The non-constant gain was due to their use of the answer digit set {xe2x88x922, xe2x88x921, 0, 1, 2}. Lee and Lang, in their paper, xe2x80x9cConstant-Factor Redundant CORDIC for Angle Calculation and Rotation,xe2x80x9d IEEE Transactions on Computers, 41:8, 1,016-1,025, Aug. 1992, designed conventional as well as redundant high-radix CORDIC algorithms. Note that xe2x80x9cRedundantxe2x80x9d here refers to the technique of storing numbers in redundant notation. This can save addition/subtraction time, but can also increase the time or circuit complexity for deciding each answer digit. We don""t use the redundant technique in this paper, and so discussions concerning that technique will be omitted.
The drawback of the techniques presented in Lee and Lang""s paper is that they only perform high-radix rotations for the latter half of the iterations; the first half are radix-two rotations. Furthermore their technique requires additional iterations to assure convergence. (As stated earlier, in high precision circuits, the invention presented here also requires additional iterations, but rarely. Also, if the invention is to be used in a pipeline where predictable delay is a must, then we can always avoid long delays by using the invention ONLY in iterations in which such delays never occur, and fall back to a radix-2 rotation stage if radix-4 would cause much more delay than a conventional radix-2 stage.) The reason Lee and Lang allows radix-4 iterations in the latter half is that at that time the arctangent function can be expressed with only one xe2x80x9conxe2x80x9d bit due to finite word length. The observation that allows them to accelerate the latter iterations is similar to, but simpler than, that which was used by Baker.
Making the first few iterations higher radix in order to speed up the entire computation is a bigger challenge than speeding up the latter iterations. It is making the first few iterations higher radix that we have found a solution for in this invention.
An improved radix-4 CORDIC vector rotator circuit iteration stage for initial iterations, using the answer digit set {xe2x88x923, xe2x88x921, 1, 3} instead of the conventional choices of either {xe2x88x923, xe2x88x922, xe2x88x921, 0, 1, 2, 3} or {xe2x88x922, xe2x88x921, 0, 1, 2}, thereby achieving constant magnitude amplification. This invention belongs to the family of rotators that keep data in two""s complement binary notation.
The invented circuit stage includes an answer digit decision module, which normally examines only a few digits of the remainder angle xcex8ixe2x88x921, thereby saving time when compared to a full-length comparison operation. Very rarely does the answer digit decision process involves examining close to all the digits of the remainder angle.
When only a few digits of the remainder angle needs to be examined, the circuit takes only approximately 20% longer delay than a conventional radix-2 CORDIC stage. Only in the rare instances where a full-length (or almost full length) comparison is required does the radix-4 stage take twice as long as a radix-2 stage. The invented rotator stage can be used either as a pipeline stage or as a single-stage iterative circuit. In the pipeline case we may choose to use the invention in stages where only a few digits of the remainder angle need to be examined, and fall back to a radix-2 stage in iterations where a long comparison would be needed. But when the invention is to be used as a single-stage iterative circuit, long comparisons may be allowed more easily. Both versions have been implemented, and in the single-stage sample implementation, a long comparison is needed only 8.7 times per 1,000 complete vector rotations (not 1,000 iterations) on average.
Therefore in any case the computation of the remainder angle in each iteration is not much slower than its counterpart in a conventional radix-2 rotator, but achieving twice as much work.
The computation of the partially-rotated vector components xi and yi employs carry-save adders to distill the four operands into two (using only two full-adder delays) and then add the two with an ordinary carry-propagate adder. Thus this computation, like the computation of the partially-rotated vector output, takes little more time than its counterpart in a conventional radix-2 rotator, but achieving twice as much work.
The following is hereby claimed as the objects and advantages of the invention described herein: to provide a radix-4 CORDIC rotator iterative stage that can perform an initial radix-4 rotation in significantly less average (and sometimes worst-case) time than twice the time taken to perform a radix-2 rotation in the same device technology. The xe2x80x9cworst-casexe2x80x9d part holds for the very first iteration, and for iterations for which the answer digits of magnitude 1 and those of magnitude 3 can be distinguished from each other by examining significantly fewer bits than the number of bits of precision used for representing angles. (As will be explained later, the invention uses the answer digit set {xe2x88x923, xe2x88x921, 1, 3}.
In addition I claim the following objects and advantages: to provide a fast radix-4 CORDIC rotator iteration stage that amplifies its input vector only by a fixed amount independent of the input vector. (The amount of amplification, of course, is dependent on the iteration index.)