Physical parameters such as light, sound, temperature, velocity and the like are converted to electrical signals by sensors. An electrical signal may be represented in the time domain as a variable that changes with time. Alternatively, a signal may be represented in the frequency domain as energy at specific frequencies. In the time domain, a sampled data digital signal is a series of data points corresponding to the original physical parameter. In the frequency domain, a sampled data digital signal is represented in the form of a plurality of discrete frequency components such as sine waves. A sampled data signal is transformed from the time domain to the frequency domain by the use of the Discrete Fourier Transform (DFT). Conversely, a sampled data signal is transformed back from the frequency domain into the time domain by the use of the Inverse Discrete Fourier Transform (IDFT).
The Discrete Fourier Transform is a fundamental digital signal-processing transformation that provides spectral information (frequency content) for analysis of signals. The DFT and IDFT permit a signal to be processed in the frequency domain. For example, frequency domain processing allows for the efficient computation of the convolution integral useful in linear filtering and for signal correlation analysis. Since the direct computation of the DFT requires a large number of arithmetic operations, the direct computation of the DFT is typically not used in real time applications.
Over the past few decades, a group of algorithms collectively known as Fast Fourier Transform (FFT) have found use in diverse applications, such as digital filtering, audio processing and spectral analysis for speech recognition. The FFT reduces the computational burden so that it may be used for real-time signal processing. In addition, the fields of applications for FFT analysis are continually expanding.
Computational Burden
Computation burden is a measure of the number of calculations required by an algorithm. The DFT (and IDFT) process starts with a number (N) of input data points and computes a number (also N) of output data points. The DFT function is a sum of products, i.e., repeated multiplication of two factors (data and twiddle coefficients) to form product terms followed by the addition of the product terms to accumulate a sum of products (multiply-accumulate, or MAC operations). The direct computation of the DFT requires a large number of such multiply-accumulate mathematical operations, especially as the number of input points N is made larger. Multiplications by the twiddle factors WNr dominate the arithmetic workload.
To reduce the computational burden imposed by the computationally intensive DFT, previous researchers developed the Fast Fourier Transform (FFT) algorithms in which the number of required mathematical operations is reduced. In one class of FFT methods, the computational burden is reduced based on the divide-and-conquer approach. The principle of the divide-and-conquer approach method is that a large problem is divided into smaller sub-problems that are easier to solve. In the FFT case, the division into sub-problems means that the input data are divided in subsets for which the DFT is computed to form partial DFTs. Then the DFT of the initial data is reconstructed from the partial DFTs. See N. W. Cooley and J. W. Tukey, “An algorithm for machine calculation of complex Fourier series”, Math.Comput., Vol. 19 pp. 297–301, April 1965. There are two approaches to dividing (also called decimating) the larger calculation task into smaller calculation sub-tasks: decimation in frequency (DIF) and decimation in time (DIT).
Butterfly Implementation of the DFT
For example, an 8-point DFT can be divided into four 2-point partial DFTs as represented in FIG. 2. The basic 2-point partial DFT is calculated in a computational element called a radix-2 butterfly (or butterfly-computing element). There are butterfly computing elements corresponding to DIT and DIF implementations. Butterfly-computing elements are arranged in an array having stages of butterfly calculation. FIGS. 1 and 3 illustrate an FFT with an array architecture having one dedicated processing element for each butterfly.
As shown in FIGS. 1 and 3, data is fed to the input of the first stage 1002, 302 of butterfly-computing elements. After the first stage of butterfly-computation is complete, the result is fed to the in input of the next stage(s) 1004, 1006, 304, 306 of butterfly-computing element(s) and so on. In particular, in FIG. 3, four radix-2 butterflies operate in parallel on 8 input data points x(0)–x(7) in the first stage 302 to compute partial DFTs. The partial DFTs outputs of the first stage 302 are combined in 2 additional stages 304, 306 to form a complete 8-point DFT output data X(0)–X(7).
FIG. 4 shows a pipelined architecture implementation of the DFT. In the pipelined architecture, each row in the FFT is collapsed into one row of logr N processing elements. In the column architecture of FIG. 2, all the stages in the FFT are collapsed into one column of N/r processing elements (PE). Assuming that a PE performs a butterfly operation in one clock cycle, the column of PEs computes one stage of the FFT for each clock cycle, and the entire FFT is computed in logr N clock cycles.
Communication Burden
A computational problem involving a large number of calculations may be performed one calculation at a time by using a single computing element. While such a solution uses a minimum of hardware, the time required to complete the calculation may be excessive. To speed up the calculation, a number of computing elements may be used in parallel to perform all or some of the calculations simultaneously. A massively parallel computation will tend to require an excessively large number of parallel-computing elements. Even so, parallel computation is limited by the communication burden. For example, a large number of data and constants may have to be retrieved from memory over a finite capacity data bus. In addition, intermediate results in one parallel-computing element may have to be temporarily stored in memory, then later retrieved from memory and communicated to another parallel-computing element. The communication burden of an algorithm is a measure of the amount of data that must be moved (written and read) to and from memory, as well as between computing elements.
The FFT algorithm is especially memory access and storage intensive. For example, in order to compute a radix-4 DIF FFT butterfly, four pieces of data and three twiddle coefficients are read from memory, and four pieces of resultant data are written back into memory. In a prior art N point FFT calculation, there are N/r butterflies per stage (where r is the radix) for logrN stages. Accordingly, it is desired to provide an efficient scheme by which input data, output data and twiddle coefficients are stored and retrieved from memory.
Different structures for the dedicated FFT, such as Common Factor Algorithm (CFA) [1], Prime Factor Algorithm (PFA) [1], Split Radix Algorithm (SRFT) [2], [3] and [4], Winograd Fourier Transform Algorithm (WFTA) [5] and [6], Mixed Radix Algorithm [7], cited below.
[1] T. Widhe, “Efficient Implementation of FFT Processing Elements” Linköping studies in Science and Technology, Thesis No. 619, Linköping University, Sweden, June 1997.
[2] H. V. Sorenson, M. T. Heideman, and C. S. Burrus, “On Computing the Split Radix FFT, IEEE Trans. on Acoustics, Speech, and Signal Processing, Vol. ASSP-34, No. 1, pp. 152–156, February 1986.
[3] M. Richards, “On Hardware Implementation of the Split-Radix FFT, IEEE trans. On Acoustics, Speech, and Signal Processing, Vol. ASSP-36, No. 10, pp. 1575–1581, October 1988.
[4] P. Duhamel, and H. Hollman, “Split Radix FFT Algorithm, Electronics Letters, Vol. 20, No. 1, pp. 14–16, January 1984.
[5] H. F. Silverman, “An Introduction to Programming the Winograd Fourier Transform Algorithm (WFTA)”, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-25, No. 2, pp. 152–165, April 1977.
[6] S. Winograd, “On Computing the Discrete Fourier Transform”, Proc. Nat. Acad. Sci. USA, Vol. 37, pp 1005–1006, April 1976.
[7] R. C. Singleton, “An Algorithm for Computing the Mixed radix Fast Fourier Transform”, IEEE Transactions on Audio and Electro-acoustics, Vol. AU-17, No. 2, PP. 93–103, June 1969.
However, none of the above FFT implementations has proposed an efficient way to access from memory the twiddle factor coefficients nor access from memory and write to memory the input and output data, respectively, in a parallel structure.
Address Generator
In an FFT implementation, an address generator is typically used to compute the addresses (locations in memory) where input data, output data and twiddle coefficients will be stored and retrieved from memory. For example, in FIG. 5 an apparatus for computing the fast Fourier transform comprises an array of radix-r butterfly processing elements 512, a memory 502 and an address generator 506. The memory 502 stores input data and twiddle coefficients used by the radix-r butterflies 512. The computed FFT output data from the radix r butterflies 512 are stored in memory 502. And input/output controller 504 controls the process of storing and retreating from memory 502.
The time required to read input data and twiddle coefficients from the memory 502, and write results back to memory 502 affects the overall time to compute the FFT. In addition to memory access time, the time required by the address generator 506 to compute the desired address itself further lengthens the overall time to compute the FFT. The design of the address generator 506 has a substantial role in determining the overall time for the computation of the FFT.
Additionally, several prior art address generator techniques have been proposed. See U.S. Pat. No. 6,035,313 to Marchant, U.S. Pat. No. 5,491,652 to Luo et al., U.S. Pat. No. 5,091,875 to Wong et al. and U.S. Pat. No. 4,899,301 to Nishitani et al.