The present invention relates to the field of Fast Fourier Transform analysis. In particular, the present invention relates to a parallel processing architecture adapted for use in a pipelined Fast Fourier Transform method and apparatus.
Physical parameters such as light, sound, temperature, velocity and the like are converted to electrical signals by sensors. An electrical signal may be represented in the time domain as a variable that changes with time. Alternatively, a signal may be represented in the frequency domain as energy at specific frequencies. In the time domain, a sampled data digital signal is a series of data points corresponding to the original physical parameter. In the frequency domain, a sampled data digital signal is represented in the form of a plurality of discrete frequency components such as sine waves. A sampled data signal is transformed from the time domain to the frequency domain by the use of the Discrete Fourier Transform (DFT). Conversely, a sampled data signal is transformed back from the frequency domain into the time domain by the use of the Inverse Discrete Fourier Transform (IDFT).
The Discrete Fourier Transform is a fundamental digital signal-processing transformation used in many applications. Frequency analysis provides spectral information about signals that are further examined or used in further processing. The DFT and IDFT permit a signal to be processed in the frequency domain. For example, frequency domain processing allows for the efficient computation of the convolution integral useful in linear filtering and for signal correlation analysis. Since the direct computation of the DFT requires a large number of arithmetic operations, the direct computation of the DFT is typically not used in real time applications.
Over the past few decades, a group of algorithms collectively known as Fast Fourier Transform (FFT) have found use in diverse applications, such as digital filtering, audio processing and spectral analysis for speech recognition. The FFT reduces the computational burden so that it may be used for real-time signal processing. In addition, the fields of applications for FFT analysis are continually expanding.
Computational Burden
Computation burden is a measure of the number of calculations required by an algorithm. The DFT process starts with a number of input data points and computes a number of output data points. For example, an 8-point DFT may have an 8-point output. See FIG. 1A. The DFT function is a sum of products, i.e., multiplications to form product terms followed by the addition of product terms to accumulate a sum of products (multiply-accumulate, or MAC operations). See equation (1) below. The direct computation of the DFT requires a large number of such multiply-accumulate mathematical operations, especially as the number of input points is made larger. Multiplications by the twiddle factors wNr dominate the arithmetic workload.
To reduce the computational burden imposed by the computationally intensive DFT, previous researchers developed the Fast Fourier Transform (FFT) algorithms in which the number of required mathematical operations is reduced. In one class of FFT methods, the computational burden is reduced based on the divide-and-conquer approach. The principle of the divide-and-conquer approach method is that a large problem is divided into smaller sub-problems that are easier to solve. In the FFT case, the division into sub-problems means that the input data are divided in subsets for which the DFT is computed to form partial DFTs. Then the DFT of the initial data is reconstructed from the partial DFTs. See N. W. Cooley and J. W. Tukey, xe2x80x9cAn algorithm for machine calculation of complex Fourier seriesxe2x80x9d, Math. Comput., Vol. 19 pp. 297-301, April 1965. There are two approaches to dividing (also called decimating) the larger calculation task into smaller calculation sub-tasks: decimation in frequency (DIF) and decimation in time (DIT).
Butterfly Implementation of the DFT
For example, an 8-point DFT can be divided into four 2-point partial DFTs. The basic 2-point partial DFT is calculated in a computational element called a radix-2 DIT butterfly (or butterfly-computing element) as represented in FIG. 2A1. Similarly to the DIT butterfly-computing element, FIG. 2A2 shows the function of a radix-2 DIF butterfly. A radix-2 butterfly has 2 inputs and 2 outputs, and computes a 2-point DFT. FIG. 2B shows an FFT using 12 radix-2 butterflies to compute an 8-point DFT. Butterfly-computing elements are arranged in stages. There are three stages 1302, 1304 and 1306 of butterfly calculation. Data, xn is fed to the input of the butterfly-computing elements in the first stage 1302. After the first stage 1302 of butterfly-computation is complete, the result is fed to the in input of the next stage(s) of butterfly-computing element(s) and so on.
In particular, four radix-2 butterflies operate in parallel in the first stage 1302 to compute 8 partial DFTs. The 8 outputs of the first stage 1302 are combined in 2 additional stages 1304, 1306 to form a complete 8-point DFT output, Xn. Specifically, the second stage 1304 of 4 radix-2 butterflies and the third stage 1306 of 4 radix-2 butterflies comprise a two stage combination phase in which 8 radix-2 butterflies responsive to 8 partial DFTs form the final 8-point DFT function, Xn.
FIG. 2C shows an FFT using 32 radix-2 butterflies to compute a 16-point DFT. There are 4 stages of butterfly calculation. Eight radix-2 butterflies operate in parallel in the first stage 1402 where 2-point partial DFTs are calculated. The outputs of the first stage are combined in 3 additional combination stages 1403, 1404 and 1406 to form a complete 16-point DFT output. The output of the second stage 1403 of 8 radix-2 butterflies is coupled to a third stage 1404 of 8 radix-2 butterflies. The output of the third stage 1404 of 8 radix-2 butterflies is coupled to a fourth stage 1406 of 8 radix-2 butterflies, the output of which the final 16-point DFT function. The combination phases 1403, 1404, 1406 comprise a combination phase in which 24 radix-2 butterflies responsive to 16 partial DFTs (from the first phase 1402) form the final 16 point DFT function, Xn.
Higher order butterflies may be used. See FIG. 2D, which uses 8 radix-4 butterflies in 2 stages 1502, 1502 to compute a 16-point DFT. In general, a radix-r butterfly is a computing element that has r input points and calculates a partial DFT of r output points. In FIG. 2D, four radix-4 butterflies compute 16 partial DFTs in a first stage 1502. The combination phase 1504 comprises four radix-4 butterflies responsive to 16 partial DFTs (from the first phase 1502) to form the final 16 point DFT function, Xn.
Communication Burden
A computational problem involving a large number of calculations may be performed one calculation at a time by using a single computing element. While such a solution uses a minimum of hardware, the time required to complete the calculation may be excessive. To speed up the calculation, a number of computing elements may be used in parallel to perform all or some of the calculations simultaneously. A massively parallel computation will tend to require an excessively large number of parallel-computing elements. Even so, parallel computation is limited by the communication burden. For example, a large number of data and constants may have to be retrieved from memory over a finite capacity data bus. In addition, intermediate results in one parallel-computing element may have to be communicated to another parallel-computing element. The communication burden of an algorithm is a measure of the amount of data that must be moved, and the number of calculations that must be performed in sequence (i.e., that cannot be performed in parallel).
In particular, in a butterfly implementation of the DFT, some of the butterfly calculations cannot be performed simultaneously, i.e., in parallel. Subsequent stages of butterflies cannot begin calculations until earlier stages of butterflies have completed prior calculations. Also, the connections between butterflies in each stage to butterflies in the other stages impose a heavy communication burden between the butterfly computation stages. Thus, parallel implementations of the butterfly DFT are hampered by a heavy communication burden between butterflies.
The heavy communication burden between the butterfly stages in the prior art results from structuring the butterfly implementation such that the first butterfly stage computes partial DFTs over the input data, and the latter butterfly stages combine the partial DFTs. In accordance with the present invention, partial DFTs are computed in a plurality of separate parallel processors and then combined in a single stage of combination for the parallel processing algorithm. Also, in accordance with the present invention for the multi-stage parallel processing algorithm, partial DFTs are computed in a plurality of separate parallel circuit boards that contain plurality of separate parallel chips, which contain plurality of separate parallel processors. The output data is obtained by combining firstly the outputs of the plurality of separate parallel processors, secondly by combining the outputs of the plurality of separate parallel chips and finally by combining the plurality of the outputs of the separate parallel circuit boards.
The present architecture is a reorganization of the butterfly calculation of the DFT so as to reduce the communication burden between butterflies implemented in parallel computing elements. In particular, no communication is required between pluralities of separate parallel processors, chips or circuit boards. A combination stage of butterfly calculation is provided which combines the outputs of all the parallel processors (chips or circuit boards).
In accordance with the present invention, the input data points of an N point DFT are divided into subsets of data. A plurality of processors, chips or circuit boards operate in parallel and independently of each other, and each being responsive to each respective subset of the input data points. The partial DFTs at the output of the plurality of parallel processors in each chip are then combined in a single combination phase to provide the complete partial DFT for a single chip. The partial DFTs at the output of the plurality of parallel chips in each circuit board are then combined in a single combination phase to provide the complete partial DFT for a single board. The partial DFTs at the output of the plurality of parallel circuit boards are then combined in a single combination phase to provide the DFT solution.
In the general case, each output data point of the DFT is a function of all of the input data points. However, by dividing the input data set into subsets, and operating each parallel processor independently of the other parallel processors in accordance with the present invention, the communication between the parallel processors is eliminated, thereby reducing the communication burden. Each or several parallel processors may then be implemented on a separate semiconductor chip or circuit board, without requiring any communication with any of the other parallel processors.
In a second embodiment of the present invention, a radix-r butterfly implementation is provided in which the plurality of independent processors are operated in parallel using the same instructions and accessing the same necessary set of multiplier coefficients from memory at the same time. The resulting algorithm, in which a number of parallel processors operate simultaneously by a single instruction sequence, reduces both the computational burden and the communication burden.