The use of FPGAs for carrying out high speed arithmetic computations has gained recognition in recent years. FPGA architectures including logic blocks having multiple look-up-table function generators, such as the XC4000.TM. family of devices from XILINX, Inc., the assignee of the present invention, are particularly suited for such computations. However, many of the important DSP algorithms are multiply-intensive, and even FPGAs having the largest number of logic blocks normally can't embed the multiplier circuits and the attendant control and support circuits in a single chip. It becomes incumbent on the designer to choose efficient DSP algorithms and to realize them with efficient circuit designs. The fast Fourier transform (FFT) is an outstanding example of an efficient DSP algorithm and distributed arithmetic is a well established design approach that replaces gate-consuming array multipliers with efficient shift and add equivalent circuits that offer comparable performance.
The discrete Fourier transform (DFT) of a sampled time series is closely related to the Fourier transform of the continuous waveform from which the time samples were taken. The DFT is thus particularly useful for digital power spectrum analysis and filtering. The FFT is a highly efficient procedure for computing the DFT of a time series and was reported by Cooley and Tukey in 1965 ("AN ALGORITHM FOR THE MACHINE CALCULATION OF COMPLEX FOURIER SERIES" by J. W. Cooley and J. W. Tukey, Math of Comput., Vol. 19, pp. 297-301, April 1965).
The FFT takes advantage of the fact that the calculation of the coefficients of the DFT can be carried out interactively, which results in a considerable savings of computation time. If the time series contains N=2.sup.n samples, then for the N Fourier coefficients the FFT entails 2nN=2Nlog.sub.2 N multiply operations (assuming a radix 2 butterfly). In contrast, the DFT algorithm requires N.sup.2 multiply operations. The FFT advantage grows as N increases. Thus, an 8 point DFT and FFT require 64 and 48 multiply operations, respectively, while an 8192 point DFT and FFT require 67.1.times.10.sup.6 and 212,384 multiply operations, respectively.
Distributed Arithmetic (DA) was developed as an efficient computation scheme for digital signal processing (DSP). A United States patent describing this scheme is U.S. Pat. No. 3,777,130 issued Dec. 3, 1974 entitled "DIGITAL FILTER FOR PCM ENCODED SIGNALS" by Croisier, D. J. Esteban, M. E. Levilion and V. Rizo. A comprehensive survey of DA applications in signal processing was made by White in "APPLICATIONS OF DISTRIBUTED ARITHMETIC TO DIGITAL SIGNAL PROCESSING: A TUTORIAL REVIEW", S. A. White, IEEE ASSP Magazine, July 1989.
The DA computation algorithm is now being effectively applied to embed DSP functions in FPGAs, particularly those with coarse-grained look-up table architectures. DA enables the replacement of the array multiplier, central to many DSP applications, with a gate-efficient serial/parallel multiplier with little or no reduction in speed. However, available FFT implementations have been limited in size due to space constraints.
DA makes extensive use of look-up tables (LUT's), thereby exploiting the LUT-based architecture of the Xilinx and other similarly structured FPGAs. The LUT used in a DA circuit will hereafter be called a DALUT. One can use a minimum set of DALUTs and adders in a sequential implementation to minimize cost. However, speed/cost tradeoffs can be made. Specifically, for higher speed, more DALUTs and adders may be employed. With enough DALUTs and adders, the range of tradeoffs extends to full parallel operation with all input bits applied simultaneously to the DALUTs and an output response generated at each system clock.
DA differs from conventional arithmetic only in order in which it performs operations. The transition from conventional to distributed arithmetic is illustrated in FIGS. 1, 2 and 3. In FIG. 1 which illustrates conventional arithmetic, the sum of products equation, S=A.multidot.K+B.multidot.L+C.multidot.M+D.multidot.N, is implemented with 4 serial/parallel multipliers operating concurrently to generate partial products. The full products are then summed in an adder tree to produce the final result, S. The functional blocks of the serial/parallel multiplier shown in the box of FIG. 1 include an array of 2-input AND gates with the A input derived from a parallel to serial shift register and the K input applied bit-parallel to all AND gates. A P bit parallel adder accepts the AND gate outputs addend inputs and passes the sum to an accumulator register. A divide by 2 block feeds back the register output to the augend inputs of the adder. In each clock cycle one bit of the serially organized data (Ai, Bi, Ci, Di) is ANDed with parallel operands (K, L, M, N) and four partial products are generated. Starting with the least significant serial bits, the partial products are stored in the four accumulator registers. On the next clock cycle, the next least significant bits again form partial products which are then added to the scaled by 1/2 previous partial product. The process repeats on successive clock cycles until the most significant bits have been shifted. When all the partial products, appropriately scaled, have been accumulated, they are fed to the adder array to produce the final output, S. Distributed arithmetic adds the partial products before, rather than after, scaling and accumulating them.
FIG. 2 shows the first embodiment of the distributed arithmetic technique. The number of shift and add circuits is reduced to one and is placed at the output of the array of simple adders, the number of simple adders remains the same. The two-input AND gates now precede the adders.
In a very important class of DSP applications known as linear, time-invariant systems, the coefficients (K, L, M and N in our example) are constants. Consequently, the data presented to the shift-and-add circuit (namely, the output of the AND gates and the three simple adders) depend only on the four shift register output bits. Replacing the AND gates and simple adders with a 16 word look-up table (DALUT) provides the final form (FIG. 3) of the distributed arithmetic implementation of the sum of products equation.
The DALUT contains the pre-computed values of all possible sums of coefficients weighted by the binary variables of the serial data (A, B, C and D) which previously constituted the second input to the AND gates. Now, with the four serial data sources serving as address lines to the DALUT, the DALUT contents may be tabulated as follows:
______________________________________ A B C D Address Content ______________________________________ 0 0 0 0 0 0 0 0 0 1 1 N 0 0 1 0 2 M 0 0 1 1 3 M + N 0 1 0 0 4 L 0 1 0 1 5 L + N 0 1 1 0 6 L + M 0 1 1 1 7 L + M + N 1 0 0 0 8 K 1 0 0 1 9 K + N 1 0 1 0 10 K + M 1 0 1 1 11 K + M + N 1 1 0 0 12 K + L 1 1 0 1 13 K + L + N 1 1 1 0 14 K + L + M 1 1 1 1 15 K + L + M + N ______________________________________
In general, the length (number of words) in the DALUT is 2.sup.a where "a" is the number of address lines. The width, or number of bits per word cannot be precisely defined; it has an upper limit of b+log2a due to computation word growth where the coefficients are summed, as the content of the DALUT indicates (wherein b is the number of coefficient bits). The width of the table defines the coefficient accuracy and may not match the number of signal bits (e.g., the bits of A, B, C, and D) which define the dynamic range or linearity of the computation process.
Large FFTs in a Single FPGA
Now that the array multiplier has been replaced by a gate-efficient distributed circuit, there remains a second obstacle to overcome before a large size FFT can be practically embedded in a single FPGA, namely, the large memory required for the sine/cosine basis functions. This problem was addressed, in part, in related U.S. patent application Ser. No. 08/815,019, entitled "A METHOD FOR CONFIGURING AN FPGA FOR LARGE FFTS AND OTHER VECTOR ROTATION COMPUTATIONS", incorporated herein, which introduced the radix-2 butterfly core for implementation of an FFT in an FPGA environment. The radix-2 butterfly core comprises a gate array implementation of the complex multiplication (x+jy)e-j.theta. where x and y are rectangular coordinates of a complex vector and .theta. is an angle of rotation. Complex multiply computations are iterative and are performed in pipelined DA stages that are nearly identical save for their look-up tables which reflect the angle segments.
Referring to FIG. 4, we see a simplified version of the radix-2-based FFT implementation. It should be noted, however, that for large transforms, higher order radices (e.g. 4 or 8) may offer greater efficiency--these, too, are amenable to DA implementation. The circuit has two inputs X.sub.m and X.sub.n, and two outputs A.sub.m and A.sub.n. Their relationship can be summarized as: A.sub.m =X.sub.m +X.sub.n =X.sub.Rm +X.sub.Rn +j (X.sub.Im +X.sub.In); and A.sub.n =[X.sub.m -X.sub.n ].times.W.sup.k, where W.sup.k =e.sup.-j.theta. k=cos.theta..sub.k -sin.theta..sub.k and .theta..sub.k =(2.PI.k)/N. A.sub.n can be simplified as A.sub.n =(X.sub.Rm -X.sub.Rn)cos.theta..sub.k +(X.sub.Im -X.sub.In)sin.theta..sub.k +j[(X.sub.Rm -X.sub.Rn)(-sin.theta..sub.k)+(X.sub.Im -X.sub.In)cos.theta..sub.k ].
Implementation of the above equations in an FPGA is therefore accomplished using a DALUT which contains the pre-computed sums of partial products for combinations of the input variables (X.sub.m -X.sub.n), and for all N/2 values of .theta..sub.k. The DALUT is addressed by the two input variables (X.sub.Rm -X.sub.Rn) and (X.sub.Im -X.sub.In) and the k bits defining .theta..sub.k.
While the radix-2 implementation disclosed in the parent case provides a number of significant advantages over the prior art, there remains a need to increase the speed of such circuits without increasing the size of FPGA-implemented DSP circuit designs beyond the capacity of available devices.