The use of FPGAs for carrying out high speed arithmetic computations has gained recognition in recent years. FPGA architectures including logic blocks having multiple look-up-table function generators, such as the XC4000.TM. family of devices from XILINX, Inc., the assignee of the present invention, are particularly suited for such computations. However, many of the important DSP algorithms are multiply-intensive, and even FPGAs having the largest number of logic blocks normally can't embed the multiplier circuits and the attendant control and support circuits in a single chip. It becomes incumbent on the designer to choose efficient DSP algorithms and to realize them with efficient circuit designs. The fast Fourier transform (FFT) is an outstanding example of an efficient DSP algorithm and distributed arithmetic is a well established design approach that replaces gate-consuming array multipliers with efficient shift and add equivalent circuits that offer comparable performance.
The discrete Fourier transform (DFT) of a sampled time series is closely related to the Fourier transform of the continuous waveform from which the time samples were taken. The FFT is a highly efficient procedure for computing the DFT of a time series and was reported by Cooley and Tukey in 1965 ("AN ALGORITHM FOR THE MACHINE CALCULATION OF COMPLEX FOURIER SERIES" by J. W. Cooley and J. W. Tukey, Math of Comput., Vol. 19, pp. 297-301, April 1965). The highly space-efficient implementation of a radix-2 circuit, illustrated and described in co-pending U.S. patent application Ser. Nos. 08/815,019 and 08/937,977 (filed on Sep. 26, 1997), both assigned to the assignee of the present invention and incorporated herein by reference, allows for the implementation of complex FFT circuitry in a single programmable logic device.
The DFT is one of the core algorithms used in many signal processing applications. Its efficient computation is therefore of paramount importance. Higher dimensional, i.e. multi-dimensional (M-D) transforms are also of great interest in many systems. A commonly used implementation approach employs software programmable VLSI DSPs. However, significant performance gains can be attained using an alternative technology like the Xilinx XC4000.TM. series field programmable gate arrays (FPGAs).
Conventional approaches, e.g. using the Cooley-Tukey (CT) algorithm, suffer from inefficiencies associated with exploiting the transform separability and decomposing the calculation into a sequence of 1-D problems. Shown in FIG. 1 is an example of an apparatus 10 for computing the 2-D DFTs using conventional processing. Data in a matrix form is input into a first RAM buffer 12. A first processor 14 using a first FFT (FFT1) reads out the input data from the buffer 12, computes the row transforms and outputs the result to a first bank of a second buffer 16. At the same time, a second processor 18 using a second FFT (FFT2) reads out the previous FFT1 transformed data from a second bank of buffer 16 and computes the column transforms and outputs the result to a third buffer 20 where an output can be taken. This technique involves a great deal of row and column processing of data matrices, which is multiplication intensive. However, performing multiplication operations using FPGAs consumes FPGA resources and is optimally to be avoided.