The present invention relates to the field of Fast Fourier Transform analysis. In particular, the present invention relates to a butterfly-processing element (BPE) arranged as a plurality of parallel computing elements (comprising complex multipliers and adders) with identical structure and adapted for use to implement a Fast Fourier Transform (FFT) butterfly computation.
Signal sensors measure a parameter of the physical world and convert the measured parameter to electrical form for transmission and processing. Typical examples are sound and video. Other sensed physical parameters, such as seismic activity, air temperature, vehicle position, velocity or acceleration, and the like, form the basis of electrical signals.
A signal may be represented in the time domain as a variable that changes with time. Alternatively, a signal may be represented in the frequency domain as energy at specific frequencies. In the time domain, a sampled data digital signal is a series of data points corresponding to the original physical parameter. In the frequency domain, a sampled data digital signal is represented in the form of a plurality of discrete frequency components such as sine waves. A sampled data signal is transformed from the time domain to the frequency domain by the use of the Discrete Fourier Transform (DFT). Conversely, a sampled data signal is transformed back from the frequency domain into the time domain by the use of the Inverse Discrete Fourier Transform (IDFT).
The Discrete Fourier Transform is a fundamental digital signal-processing transformation used in many applications. Frequency analysis provides spectral information about signals that are further examined or used in further processing. The DFT and IDFT permit a signal to be processed in the frequency domain. For example, frequency domain processing allows for the efficient computation of the convolution integral useful in linear filtering and for signal correlation analysis. Since the direct computation of the DFT requires a large number of arithmetic operations, the direct computation of the DFT is typically not used in real time applications.
Over the past few decades, a group of algorithms collectively known as Fast Fourier Transform (FFT) have found use in diverse applications, such as digital filtering, audio processing and spectral analysis for speech recognition. The FFT reduces the computational burden so that it may be used for real-time signal processing. In addition, the fields of applications for FFT analysis are continually expanding to include, Cepstrum analysis, image processing and video coding, radar and sonar processing including target detection, seismic analysis, advanced frequency division modulation schemes such as OFDM and power system reliability.
Computational Burden
Computation burden is a measure of the number of calculations required by an algorithm. The DFT process starts with a number of input data points and computes a number of output data points. For example, an 8-point DFT may have an 8-point output. See FIG. 9. The DFT function is a sum of products, i.e., multiplications to form product terms followed by the addition of product terms to accumulate a sum of products (multiply-accumulate, or MAC operations). See equation (1) below. The direct computation of the DFT requires a large number of such multiply-accumulate mathematical operations, especially as the number of input points is made larger. Multiplications by the twiddle factors WNr dominate the arithmetic workload.
To reduce the computational burden imposed by the computationally intensive DFT, previous researchers developed the Fast Fourier Transform (FFT) algorithms in which the number of required mathematical operations is reduced. In one class of FFT methods, the computational burden is reduced based on the divide-and-conquer approach. The principle of the divide-and-conquer approach method is that a large problem is divided into smaller sub-problems that are easier to solve. In the FFT case, the division into sub-problems means that the input data are divided in subsets for which the DFT is computed to form partial DFTs. Then the DFT of the initial data is reconstructed from the partial DFTs. See N. W. Cooley and J. W. Tukey, xe2x80x9cAn algorithm for machine calculation of complex Fourier seriesxe2x80x9d, Math.Comput., Vol. 19 pp. 297-301, April 1965. There are two approaches to dividing (also called decimating) the larger calculation task into smaller calculation sub-tasks: decimation in frequency (DIF) and decimation in time (DIT).
Butterfly Implementation of the DFT
In the FFT, an 8-point DFT is divided into 2-point partial DFTs. The basic 2-point partial DFT is calculated in a computational element called a radix-2 butterfly (or butterfly-computing element) as represented in FIG. 12. A radix-2 butterfly has 2 inputs and 2 outputs, and computes a 2-point DFT. FIG. 13 shows an FTT using 12 radix-2 butterflies to compute an 8-point DFT. Butterfly-computing elements are arranged in stages. There are three stages 1302, 1304 and 1306 of butterfly calculation. Data, xn is input to the butterfly-computing elements in the first stage 1302. After the first stage 1302 of butterfly-computation is complete, the result in input to the next stage(s) of butterfly-computing element(s).
Four radix-2 butterflies operate in parallel in the first stage 1302. The outputs of the first stage 1302 are combined in 2 additional stages 1304, 1306 to form a complete 8-point DFT output, Xn. The output of the second stage 1304 of radix-2 butterflies is coupled to a third stage 1306 of four radix-2 butterflies. The output of the third stage 1306 of four radix-2 butterflies is the final 8-point DFT function, Xn.
FIG. 14 shows an FFT using 32 radix-2 butterflies to compute a 16-point DFT. There are 4 stages of butterfly calculation. Eight radix-2 butterflies operate in parallel in the first stage 1402 where 2-point partial DFTs are calculated. The outputs of the first stage are combined in 3 additional stages 1403, 1404 and 1406 to form a complete 16-point DFT output. The output of the second stage 1403 of 8 radix-2 butterflies is coupled to a third stage 1404 of 8 radix-2 butterflies. The output of the third stage 1404 of 8 radix-2 butterflies is coupled to a fourth stage 1406 of 8 radix-2 butterflies, the output of which the final 16-point DFT function.
Higher order butterflies may be used. See FIG. 15, which uses 8 radix-4 butterflies in 2 stages 1502, 1502 to compute a 16-point DFT. In general, a radix-r butterfly is a computing element that has r input points and calculates a partial DFT of r output points.
Communication Burden
A computational problem involving a large number of calculations may be performed one calculation at a time by using a single computing element. While such a solution uses a minimum of hardware, the time required to complete the calculation may be excessive. To speed up the calculation, a number of computing elements may be used in parallel to perform all or some of the calculations simultaneously. A massively parallel computation will tend to require an excessively large number of parallel computing elements. Even so, parallel computation is limited by the communication burden. For example, a large number of data and constants may have to be retrieved from memory over a finite capacity data bus. In addition, intermediate results from one stage may have to be completed before beginning a later stage calculation. The communication burden of an algorithm is a measure of the amount of data that must be moved, and the number of calculations that must be performed in sequence (i.e., that cannot be performed in parallel).
In particular, in an FFT butterfly implementation of the DFT, some of the butterfly calculations cannot be performed simultaneously, i.e., in parallel. Subsequent stages of butterflies cannot begin calculations until earlier stages of butterflies have completed prior calculations. The communication burden between stages of butterfly calculation cannot therefore be reduced through the use of parallel computation. While the FFT has a smaller computational burden as compared to the direct computation of the DFT, the butterfly implementation of the FFT has a greater communication burden.
Within the butterfly-computing element itself (i.e., within the radix-r butterfly), there are similar considerations of computational burden versus communication burden. That is, within the radix-r butterfly-computing element itself, not all the required calculations can be performed simultaneously by parallel computing elements. Intermediate results from one calculation are often required for a later computation. Thus, while the FFT butterfly implementation of the DFT reduces the computational burden, it does not decrease the communication burden.
Higher Radix Butterflies
Using a higher radix butterfly can reduce the communication burden. For example, a 16-point DFT may be computed in two stages of radix-4 butterflies as shown in FIG. 15, as compared to three stages in FIG. 13 or four stages in FIG. 14. Higher radix FFT algorithms are attractive for hardware implementation because of the reduced net number of complex multiplications (including trivial ones) and the reduced number of stages, which reduces the memory access rate requirement. The number of stages corresponds to the amount of global communication and/or memory accesses in an implementation. Thus, reducing the number of stages reduces the communication burden. FIG. 10 shows a mixed radix butterfly implementation of the FFT with two stages of butterfly computation. FIG. 11 shows a mixed radix butterfly implementation of the FFT with three stages of butterfly computation.
Typically, the higher order radix-r butterflies are not used, even though such butterflies will have a smaller net number of complex multiplications and such higher radix butterflies reduce the communication load. The reason higher order radix-r butterflies have not been more commonly used is that the complexity of the radix-r butterfly increases rapidly for higher radices. The increased complexity of the higher radices butterfly makes higher order radix-r butterflies difficult to implement. As a result, the vast majority of FFT processor implementations have used the radix-2 or radix-4 versions of the FFT algorithm. Therefore, in spite of the attractiveness of using a higher order radix butterfly in an FFT algorithm, hardware implementations of FFT algorithms with radices higher than radix-4 are rare.
Butterfly-processing Element (BPE) and the Radix-r Butterfly
The present invention is embodied in a butterfly-processing element (BPE), or engine, that can be utilized in an array of butterfly-processing elements (BPEs) each having substantially identical structures, to reduce the complexity in implementing radix-r FFT calculations. The approach is applicable to butterfly implementations in both DIF and DIT FFT algorithms.
In particular, the present invention is embodied in a BPE useful in building a radix-r butterfly with fewer calculations than that required in conventional implementations, thus reducing the computational burden. In addition, the present invention is embodied in a BPE, which results in a radix-r butterfly with a greater degree of parallelism and reduced number of calculation phases internal to the radix-r butterfly, thus increasing the radix-r butterfly-processing speed.
Furthermore, the use of the BPE of the present invention permits the implementation of higher order radix-r butterflies, which are useful in implementing FFT algorithms with a reduced number of stages, and therefore a reduced communication burden. Because the BPE is a basic building block for all radix-r butterflies, therefore, assembly of repeating BPEs can be used to easily create higher order radix-r butterflies. Furthermore, while the advantages of the present approach over the prior art apply to all radix-r butterflies, the advantages are particularly apparent when applied to higher order radix-r butterflies (i.e., radix-8 and above).
Mathematical Basis
A mathematical term that is a function of r input points and provides a single output point is the basis for the design of the present BPE. To provide the insight forming the basis of the present BPE, the basic DFT equation is factored to group the variables used in multiplication (and simultaneously accessed from memory) into one matrix. In particular, starting from the basic DFT equations, the adder matrix is factored and combined with the twiddle matrix to form a single phase of calculation. By grouping all the multiply calculations into one calculation phase and all the addition calculations into the remaining calculation phases, the total number of calculations is reduced and the degree of parallelism is increased.
For a radix-r DIF butterfly, r identical BPEs are arranged in parallel. Each of the r identical BPEs are substantially identical to each other and are operated in parallel using the same instructions and accessing the necessary set of multiplier coefficients from memory at the same time. The outputs of the r identical BPEs form the DFT""s r output points.
For a radix-r DIT butterfly, (rxe2x88x921) identical BPEs are arranged in parallel. Each of the (rxe2x88x921) identical BPEs is substantially identical to the others and operates in parallel using the same instructions and accessing the necessary set of multiplier constants from memory at the same time. The outputs of the (rxe2x88x921) identical BPEs form the DFT as (rxe2x88x921) of the r output points of the butterfly. The remaining output point (X0) of the DFT is formed as the sum of the r input points.
Trivial multiplications encountered during the execution of particular butterflies may be avoided by simple checks on the coefficient addresses. Avoiding trivial multiplications reduces the computational load of particular butterflies.
An FFT implementation is composed of a plurality of radix-r butterflies with identical BPEs and a systematic addressing scheme for accessing the corresponding multiplier coefficients. Each radix-r butterfly utilizes the basic computing unit (BPE), with r (or rxe2x88x921) complex multipliers in parallel to implement each of the butterfly computations"" output. There is a simple mapping relationship from the three indices (FFT stage, radix-r butterfly, butterfly-processing element) to the addresses of the needed multiplier coefficients. The simple mapping from the three indices to the addresses of the necessary multiplier coefficients accommodates the complexity of higher order radix and mixed radix butterfly implementations of the DFT.
In a multiprocessor environment, much of the calculations are performed in parallel to greatly increase processing speed. Even for a single-processor environment, the invented architecture results in a reduced time delay for the complete FFT.
By using the BPE of the present invention in the implementation of the radix-r butterfly, an FFT implementation is achieved with a reduced number of calculations and a reduced number of stages of calculations. In addition, the amount of parallelism, both within the butterfly-processing element (BPE) calculation phases and within the overall FFT algorithm butterfly stages permits the use of parallel processing to increase overall FFT calculation speed.
Another aspect of the present invention is the address generator(s), which is used to access or store the twiddle factors, the input and the output data. In particular, the value of a specified twiddle factor is stored in a virtual memory location equal to its exponent (power). In other words the value of twiddle factor w0 is stored into the twiddle factor memory at the address location 0 (as real or virtual address). In a similar manner, the input and output data are accessed or stored according to their perspective indices generated by the reading and writing address generators.