Fast Fourier transforms (FATS) are a set of algorithms for computing discrete Fourier transforms (DOTS). The discrete Fourier transform is the central computation in many spectral analysis problems that are encountered in signal processing. For example, in speech recognition, spectral analysis is usually the first, or at least an early, step in processing an acoustic signal to discern words or word components. In sonar systems, sophisticated spectrum analysis is required for the location of surface vessels and submarines. In radar systems, obtaining target velocity information generally involves the measurement of spectra. Fast Fourier transforms can in some cases reduce the computational time required to compute discrete Fourier transforms by a factor of 100 or more.
In general, FATS operate by breaking an original N-point sequence into shorter sequences, the DFTs of which can be combined to give the DFT of the original N-point sequence. Because a direct evaluation of an N-point DFT requires (Nxe2x88x921)2 complex multiplications and N(Nxe2x88x921) complex additions, breaking down the calculation into smaller units results in a significant time savings, even in light of the further steps required to combine the results of the smaller DFT calculations.
The basic calculating operation in most FFT algorithms is the so-called xe2x80x9cbutterfly.xe2x80x9d A butterfly is a complex adder-subtractor operation that has the same number of inputs as outputs. The number of inputs, which corresponds to the level of decomposition of the DFT, is sometimes referred to as the xe2x80x9cradix.xe2x80x9d Thus, a radix-4 calculation will have four inputs to, and four outputs from, each butterfly. Taking the example of a 256-point DFT calculation, the DFT may be broken into 64 4-point DFTs which may be combined using 64 (N/4) butterflies in each of 4 (log4(N)) stages, making a total of 4 stages in the FFT including the 4-point DFT calculations. Combining multiplications are also applied at each butterfly. The multipliers employed are referred to as xe2x80x9ctwiddle factorsxe2x80x9d or sometimes as phase or rotation factors.
One important aspect of the butterfly structure is that, because the operation has the same number of inputs as outputs, intermediate stage FFT calculations require no additional memory allocation (except of course for the twiddle factors) during calculation because the outputs of the butterfly can be written back to memory at the same location that the inputs were taken from. An FFT algorithm that uses the same locations to store both the input and output sequences is called an in-place algorithm.
The order of the inputs to, and outputs from, an FFT calculation is important because typical FFT algorithms result in the outputs being in xe2x80x9cbit-reversedxe2x80x9d order from the inputs. While algorithms for reordering the outputs to be in natural order (or reordering the inputs to be in bit reversed order) are known in the art, additional processing steps must be employed to perform the reordering. Because FATS are often calculated many at a time, processing time could be meaningfully reduced if an improved ordering scheme could be applied.
In addition, the processing of FFT algorithms can be improved by employing vector processors. Traditionally, many high-performance computational devices for processing spectra have included a combination of a single microprocessor controlling system functions with off-chip devices, such as DSP (digital signal processor) farms or custom ASICs (application specific integrated circuits) to perform specialized computations. More recently, array or vector processors have been employed to address high-bandwidth data processing and algorithmic-intensive computations in a single chip. In one common type of array processor, the SIMD (xe2x80x9csingle instruction, multiple data) processor, a single instruction operates on a vector having multiple values rather than on a single value in a register. One exemplary SIMD processor is the PowerPC(trademark) G4(trademark) processor having AltiVec(trademark) technology. FFT algorithms, however, must be improved upon to take full advantage of the promise offered by vector processing systems.
The methods and apparatus of the invention improve on existing FFT calculations by providing a multistage FFT calculation in which the final stage is characterized by two processing loops that store the outputs of butterfly calculations in a shuffled order that results in the FFT outputs being correctly ordered with no need to perform an additional bit-reversal ordering pass. In one embodiment, the invention provides a system for performing a fast Fourier transform on N ordered inputs in n stages. The system includes a non-final stage calculating section that repetitively performs in-place butterfly calculations for nxe2x88x921 stages, as well as a final stage calculating section that performs a final stage of butterfly calculations.
The final stage calculating section executes a first loop and a second loop. The first loop performs a portion of the final stage butterfly calculations by iterating on a table of first loop index values consisting of values that bit-reverse into themselves. The first loop also executes control logic to select inputs for groups of butterfly calculations based on the first loop index values, to perform the groups of butterfly calculations, and to store the butterfly calculation outputs in shuffled order in place of the selected inputs to result in a correct ordering of transform outputs.
The second loop performs the remaining portion of the final stage butterfly calculations by iterating on a table of second loop index value pairs, where each pair consists of two values that bit-reverse into each other. The second loop executes control logic to select inputs for two groups of butterfly calculations based on the two second loop index pair values respectively, to perform the two groups of butterfly calculations, and to store the butterfly calculation outputs from one of the two groups butterfly calculations in shuffled order in place of the inputs selected for the other of the two groups of butterfly calculations and vice versa.
In more specific embodiments, N is a power of two and in the final calculating stage, calculations are carried out as radix-4 butterflies. The system of the invention can also be a computer system having a four-fold SIMD processor wherein the butterfly calculations are carried out as four simultaneous radix-4 butterflies.
Where radix-4 butterflies are performed in groups of four, a specific shuffling order can be employed by representing the inputs and outputs for the four radix-4 butterflies as 4xc3x974 matrices. In this scenario, the first loop iterates through a list of first loop index values between 0 and N/16xe2x88x921 that bit reverse into themselves. Control logic executed by the first loop selects four groups of four consecutive inputs by transforming the first loop index value into four input indices by multiplying the first loop index value by four, and successively adding N/4 to result in four input indices. Each group of four consecutive inputs is then selected beginning with one input index. Four radix-4 butterfly calculations, one calculation for each group of four inputs, are performed and the outputs are stored in place of the inputs in shuffled order. The shuffled order results from a 4xc3x974 matrix transposition and subsequent swapping of two inner columns.
Following this same scenario, the second loop iterates through a list of second loop index pair values which includes pairs of values between 0 and n/16xe2x88x921 that bit-reverse into each other. In the second loop, two sets of four groups of four consecutive inputs are selected by transforming each value in the second loop index pair into four input indices and selecting four consecutive inputs at each input index as in the first loop. Two sets of four radix-4 butterfly calculations are then performed and the outputs are stored in place of each other (one set""s outputs over the other set""s inputs and vice versa) using the same shuffling of order used in the first loop.