The present invention relates in general to Fast Fourier Transform (FFT) and related transform computations, and in particular to methods of using arrays of threads that are capable of sharing data with each other for performing parallel computations of FFTs and related transforms.
The Fourier Transform can be used to map a time domain signal to its frequency domain counterpart. Conversely, an Inverse Fourier Transform can be used to map a frequency domain signal to its time domain counterpart. Fourier transforms are commonly used for spectral analysis of time domain signals, for modulating signal streams in communication systems, and for numerous other applications.
Systems that process sampled data (e.g., conventional digital signal processors) generally implement a Discrete Fourier Transform (DFT) in which a processor performs the transform on a predetermined number of discrete samples. However, the DFT is computationally intensive; the number of computations required to perform an N-point DFT is O(N2). In some processors, the amount of processing power dedicated to performing DFTs may limit the processor's ability to perform other operations. Additionally, systems that are configured to operate in real time may not have sufficient processing power to perform a large DFT within a time allocated for the computation; the limited number of samples can adversely affect quality of the resulting signals.
The Fast Fourier Transform (FFT) is an implementation of the DFT that allows a DFT to be performed in significantly fewer operations. For example, the radix-2 FFT algorithm recursively breaks down an N-point FFT into two N/2-point FFTs until the computation is reduced to N/2 2-point FFTs. For a decimation-in-time algorithm, each 2-point FFT is computed using an FFT “butterfly” computation of the form:a′i1=ai1+ai2e−j2πk/N a′i2=ai1−ai2e−j2πk/N,  (Eq. 1),
where ai1 and ai2 are two points in the initial data set, k is in the range 0 to N−1 (with the value of k depending on i1 and i2), and j=√{square root over (−1)}. Computed values a′i1 and a′i2 replace the original values ai1 and ai2 in the data set. The computation proceeds in stages (also referred to herein as “levels”), with pairs of output points a′i1, a′i2 generated in one stage being used as input points for the next stage. At each stage, or level, pairs of points are “butterflied” using Eq. 1. In one implementation, indices i1 and i2 identifying pairs of points for each butterfly are separated from each other by a “stride” of 2L, where L is a level index that increments from 0 to log2 N−1. This algorithm requires O(N log2 N) computations to complete.
As is known in the art, some FFT algorithms produce output data points out of the “natural” sequence. For instance, when transforming from time domain to frequency domain using a forward DFT, it is expected that if the input samples are provided in temporal order, the output spectral samples should be in order of ascending frequency; for transforming from frequency domain to time domain (the inverse DFT), the converse is expected. Some FFT algorithms, however, generate the output points in a permuted sequence in which the indices are “bit-reversed,” i.e., the index of an output data point, expressed as a binary number, is a mirror image of the index bits for the output data point in the natural sequence.
Accordingly, some FFT implementations perform bit-reversal, permuting the data points such that the output data is presented in its natural sequence. In some decimation-in-time implementations, the data set is permuted prior to the first set of butterfly computations by bit-reversing each index. The output data is then in the natural sequence. In other decimation-in-frequency implementations, the output data set is permuted by bit-reversing each index after the last set of butterfly computations.
Conventional FFT implementations on processors such as digital signal processors (DSPs), central processing units (CPUs), and parallel processing systems rely heavily on a hardware-managed cache hierarchy to store the intermediate result data. These implementations require multiple accesses to an off-chip memory, which may increase memory bandwidth requirements or slow throughput.
It would therefore be desirable to provide faster implementations of FFT algorithms.