Signal processing systems are typically required to convert signals between time and frequency domains. The Fast Fourier Transform (FFT) algorithm enables such signal conversion between time and frequency domains. Compared with other transform algorithms, FFT has advantages of uniform structure and less computation, and thus has been widely used in signal processing systems.
FFT takes N pieces of data as input and outputs N pieces of data. In general, a transform from time to frequency domain is called forward transform, while a transform from frequency to time domain is called inverse transform. There are many approaches for implementing FFT, and they are all evolved from the Cooley-Tukey algorithm. The radix-2 Cooley-Tukey algorithm has log2N computation stages for N data points. Each computation stage takes N data points as input and outputs N data points. The output from the previous stage is sorted in certain manner and used as input to the next stage. The input to the first stage is original data, and the output from the last stage is the result of FFT computation. FIG. 1 shows a computation flow including three computation stages 103 (S0, S1, S2) by assuming that the length of the data points is 8. Each computation stage 103 is formed by N/2 butterfly units 102, of which the computation flow is shown in FIG. 2. Each butterfly unit 102 takes two data points A and B and a twiddle factor W as input, and obtains two results A+BW and A−BW after butterfly computation. In the computation of each butterfly unit, the serial numbers of the input data A and B has a correspondence which is determined by which computation stage the butterfly unit is positioned, and the serial number of the input data A or B. Meanwhile, the value of the twiddle factor W is determined by the computation stage 103 where the current butterfly unit is positioned, the serial number of the input data A or B, and the data length for FFT. In the computation stage S0 of FIG. 1, the first data and the zeroth data form a butterfly unit, the zeroth data is the input A to the butterfly unit, and the first data is the input B to the butterfly unit. The value of W is 1. In the computation stage S1, the first data and the third data form a butterfly unit, the first data is the input A of the butterfly unit, the third data is the input B of the butterfly unit, and the value of W is 1.
The computation stages are data-dependent, and the next stage can start its computation until the computation of the previous stage is completed. Accordingly, after completing the computation, each stage stores the results in a memory, and the next stage reads from the memory the computation results of the previous stage as input. The butterfly units in a computation stage are independent of each other, and the order in which these butterfly units conduct computation does not affect the results. However, the data A, B and the twiddle factor W read out by each butterfly unit must satisfy certain internal correspondence.
Studies are currently made on parallel FFT computation at home and abroad, such as CN patent 200910054018.9 (“Method for Implementing Parallel-Structure FFT Processors Based on FPGA”), CN patent 201110163600.6 (“FFT Device and Method Based on Parallel Processing”), U.S. Pat. No. 6,792,441B2 (“Parallel MultiProcessing For Fast Fourier Transform With Pipeline Architecture”). Such patent documents focus on how to decompose a long sequence of FFT data into a plurality of short sequences of FFT data, use a plurality of processors to compute the respective short sequences of FFT data in parallel, and then interleave the short sequences of FFT results to obtain a final long sequence of FFT result. There are multiple stages of butterfly computation in FFT of short sequences. Each stage of butterfly computation requires associated memory access operations, which cause a long delay. Therefore, such parallel butterfly computation methods are limited in terms of speed.