Despite many new technologies, the Fourier transform remains the workhorse for signal processing analysis in the future. The Discrete Fourier Transform (DFT) is a mathematical procedure that stands at the center of the processing that takes place inside a Digital Signal Processor. Similar to the splitting up of a light beam through a prism, the Fourier transform generates a map of a signal, (i.e., called its spectrum), in terms of the energy amplitude over its various frequency components at regular (e.g. discrete) time intervals known as the signal's sampling rate. This signal spectrum can then be mathematically processed according to the requirements of a specific application such as noise filtering, image enhancement, etc.
When the DFT is applied to samples taken from a complicated and irregular signal, such as that generated by speech in a microphone, the result is a set of sine and cosine coefficients, which represent the amplitude of the signal at given frequencies. When standard sine and cosine waves of appropriate frequencies are multiplied by these coefficients and added back together, the original waveform is exactly reconstructed as shown in FIG. 19. Therefore, a DFT is a decomposition of a sampled signal in terms of sinusoidal, (complex exponential), components.
Because of its large computational requirements, a DFT algorithm, which requires N2 complex multiplications plus a small number of operations to complete a complex addition or subtraction, is typically not used for real time signal processing. Several efficient methods have been developed to compute the DFT, where the symmetry and periodicity properties of the DFT are exploited to significantly lower its computational requirements. These resulting algorithms are known collectively as fast Fourier transforms (FFTs).
The basis of the FFT is that a DFT can be divided into two smaller DFTs, each of which is divided into two smaller DFTs, and so on, resulting in a combination of two points DFTs. In a similar fashion, a radix-4 FFT divides the DFT into four smaller DFTs, each of which is divided into four smaller DFTs, and so on, resulting in a combination of four-points DFTs. FIG. 4 is an example of 16-point FFT radix-2 on four parallel processors combined with four radix four butterflies.
Several methods are used repeatedly to split the DFTs into smaller (two or four-point) core or kernel calculations as shown in FIGS. 20(a) and 20(b).
One “rediscovery” of the FFT, that of Danielson and Lanczos in 1942, provides one of the clearest derivations of the algorithm. Danielson and Lanczos showed that a DFT could be written as the sum of two DFTs each of length N/2. One of the two is formed from the even-numbered points of the original N, the other from the odd-numbered points. The wonderful thing about the Danielson-Lanczos Lemma is that it can be used recursively. Having reduced the problem of computing X(k) to that of computing Xe(k) and Xo(k), the same reduction of Xe(k) can be utilized to the problem of computing the transform of its N/4 even-numbered input data and N/4 odd-numbered data. In other words, Xee(k) and Xeo(k) can be defined to be the DFT of the points, which are respectively even-even and even-odd on the successive subdivisions of data. With the restriction on N of being a power of two, it is evident that the Danielson-Lanczos Lemma can be applied until the data are subdivided all the way down to transforms of length 1 in FIG. 21. The Fourier transform of length one is just the identity operation that copies its one input number into its one output slot. Thus, for every pattern of log2 N e's and o's, there is a one-point transform that is just one of the input numbers x(n) Xeoeeoeo . . . oee(k)=x(n) for some n.
To figure out which value of n corresponds to which pattern of e's an o's is obtained by reversing the pattern of e's and o's and by letting e=0 and o=1, which give the value of n in binary representation.
For the last decade, the main concern of researchers in this field was to develop an FFT algorithm in which the number of required operations is minimized. Recent findings have shown that the number of multiplications required to compute the DFT of a sequence may be considerably reduced by using one of the FFT algorithms, and interest has arisen both in finding applications for this powerful transform and for considering various FFT software and hardware implementations. As a result, different pre- and post-processing techniques have been developed to further reduce the computational costs when the input sequence is known to satisfy some a priori conditions.
For instance, if the input sequence is real, the DFT may be computed using a half-complex input DFT. One of the bottlenecks in most applications, where high performance is required, is the FFT/IFFT processor. If the 2n or 4n restriction on the transform length is a problem, the solution is to design a radix-r butterfly processing element (PE) comprising butterflies (or engines) with identical structures that could be implemented in parallel in order to reduce the complexity of the PE and to decrease the processing time.
Each of these proposed algorithms has its own characteristic advantages and disadvantages. However, they all have two common problems, which are the communication load and the computational reduction. It is not unusual to find numerous algorithms to complete a given DFT task. Accordingly, finding the best algorithm is a crucial engineering problem for the real time signals' analysis.
It has been shown that the butterfly computation relies on three major parameters: input data, output data and a twiddle factor. In order to control the data flow, numerous architectures for the dedicated FFT processor implementation have been proposed and developed. Some of the more common architectures are described briefly herein. The description is limited to architectures for implementation of the fixed and mixed radix common factor FFT algorithms.
Array Architecture
The array architecture is an isomorphic mapping of the FFT signal flow graph (SFG) with one dedicated processing element for each butterfly in the SFG. This architecture requires (N/r)×logrN processing elements and the area requirements increase quickly with N. Thus, most implementations are limited to N=8 or 16. FIGS. 5 and 6 show examples of the array architecture.
A problem with this architecture, in addition to the high area requirement, is that the input data are sequential and the output data are generated in parallel, leading to a low utilization of the processing elements (PEs). This problem can be overcome by supplying the inputs with N parallel data frames at the same time, skewed one clock cycle with respect to each other. This increases the utilization of the PEs to 100%.
Column Architecture
FIG. 7 is an example of the column architecture. In the column architecture, all the stages in the FFT SFG are collapsed into one column of N/r PEs. Assuming that a PE performs a butterfly operation in one clock cycle, the column of PEs computes one stage of the FFT at each one clock cycle and the entire FFT is computed in logrN clock cycles. To simplify the switch network, a constant geometry version of the algorithm has been used in the architecture of FIG. 7. The data shuffling between the stages are identical compared to FIG. 9.
The significant advantage of such an implementation is that the number of PEs is substantially reduced as compared to the array architecture. It has been argued that the area requirement is still high for large N with an increasing complexity in the switching network structure, which is true if implemented on lower radices FFT architecture.