Signal processing systems are typically required to convert signals between time and frequency domains. The Fast Fourier Transform (FFT) algorithm enables such signal conversion between time and frequency domains. Compared with other transform algorithms, FFT has advantages of uniform structure and less computation, and thus has been widely used in signal processing systems.
FFT takes N points of data as input and outputs N points of data. In general, a transform from time to frequency domain is called forward transform, while a transform from frequency to time domain is called inverse transform. There are many approaches for implementing FFT, and they are all evolved from the Cooley-Tukey algorithm. The radix-2 Cooley-Tukey algorithm has log2 N computation stages for N data points. Each computation stage takes N data points and outputs N data points. The output from the previous stage is sorted in certain manner and used as input to the next stage. The input to the first stage is original data, and the output from the last stage is the result of FFT computation. FIG. 1 shows a computation flow including three computation stages 103 (S0, S1, S2) by assuming that the length of the data points is 8.
Each computation stage 103 is formed by N/2 butterflies (102), of which the computation flow is shown in FIG. 2. Each butterfly takes two data points A and B and a twiddle factor W as input, and obtains two results A+BW and A−BW. In the computation of each butterfly, the indices of the input data A and B has a correspondence which is determined by the computation stage of the butterfly, and the index of the input data A or B. Meanwhile, the value of the twiddle factor W is determined by the computation stage 103 of the current butterfly, the index of the input data A or B, and the FFT data length. For example, in the computation stage S0 of FIG. 1, the first data and the zeroth data form a butterfly, the zeroth data is the input A to the butterfly, and the first data is the input B to the butterfly. The value of W is 1. In the computation stage S1, the first data and the third data form a butterfly, the first data is the input A of the butterfly, the third data is the input B of the butterfly, and the value of W is 1.
The computation stages are data-dependent, and the next stage can only start its computation until the computation of the previous stage is completed. Accordingly, after completing the computation, each stage stores the results in a memory, and the next stage reads from the memory the computation results of the previous stage as input. The butterflies in a computation stage are independent of each other, and the order of the butterfly's computation does not affect the results. However, the data A, B and the twiddle factor W read out by each butterfly must satisfy certain internal correspondence.
Most patent documents relating to parallel FFT algorithms focus on how to decompose a long sequence of FFT data into a plurality of short sequences of FFT data, use a plurality of processors to compute the respective short sequences of FFT data in parallel, and then interleave the short sequences of FFT results to obtain a final long sequence of FFT result.
An example is U.S. Pat. No. 6,792,441B2 (“Parallel MultiProcessing For Fast Fourier Transform With Pipeline Architecture”). Such algorithms do not consider possible conflict when the plurality of processors access the memory at the same time, or how the processors interleave the short sequences of FFT results. In to practical applications, the conflict in memory access and synchronization and communication efficiency among the processors will greatly affect FFT computation efficiency.
The U.S. Pat. No. 6,304,887B1 (“FFT-Based Parallel System For Array Processing With Low Latency”) discusses parallel read/write of data in FFT. According to the patent document, the FFT data are stored in a plurality of memories, and sorted by using multiple data buffers and multiple selectors, in order to guarantee that for each R/W operation, data are distributed in a different memory. In this way, it is possible to achieve parallel read/write of data. In the patent document, dedicated memories, data buffers and selectors are required, and calculation of R/W addresses is complex. Thus, it is difficult to implement parallel FFT computation with different data lengths and R/W granularities.