This invention relates generally to a technique for computing a Fast Fourier Transform (FFT) and more particularly to methods and apparatus for computing an FFT in which the number of loop operations are reduced and the resultant output data values from each stage data are stored in a memory with a unity stride.
The fast Fourier transform (FFT) is the generic name for a class of computationally efficient algorithms that implement the discrete Fourier transforms (DFT), and are widely used in the field of digital signal processing.
A band-limited time-varying analog signal can be converted into a series of discrete digital signals by sampling the analog signal at or above the Nyquist frequency, to avoid aliasing, and digitizing the sampled analog signals. A DFT algorithm may be applied to these digitized samples to calculate the discrete frequency components contained within the analog signal. The DFT algorithm provides, as output data values, the magnitude and phase of the discrete frequency components of the analog signal. These discrete frequency components are evenly spaced between 0 and ½ the sampling frequency, which is typically the Nyquist sampling frequency. The number of discrete frequency components is equal to the number of the digitized samples that are used as input data. For example, a DFT having 8 input samples, will have 8 evenly spaced frequency components as output.
The DFT is given by:
      X    ⁡          (      k      )        =            1      N        ⁢                  ∑                  n          =          0                          N          -          1                    ⁢                        x          ⁡                      (            n            )                          ⁢                  ⅇ                      j            ⁢                                          2                ⁢                xnk                            N                                          where:                N is the number of input samples;        n is the particular index in the time domain sample from n=0 to n=N−1;        x(n) is the magnitude of the time domain analog signal at the time sample point corresponding to n;        k is the particular frequency domain component from k=0 to k=N−1; and        X(k) is the magnitude of the frequency component corresponding to the frequency index k.        
The DFT involves a large number of calculations and memory operations and, as such, is not computationally efficient. The FFT algorithm reduces the computational load of calculating the discrete frequency components in a time domain signal from approximately 6(N2) to approximately Nlog2N. As will be discussed in detail below, this reduction in the number of calculations is achieved by decomposing the standard DFT algorithm into a series of smaller and smaller DFTs. For example, an 8 point DFT can be decomposed into an FFT involving 3 stages of calculations. In this manner the 8 point FFT is decomposed into one 8 point FFT that can be decomposed into two 4 point DFTs that are decomposed into four 2 point DFTs.
At each stage of the FFT algorithm the canonical mathematical operations performed on each pair of input data is known as the FFT butterfly operation. FIG. 4 illustrates the canonical FFT butterfly operations which areX(m+1)=X(m)+W(n,k)Y(m)Y(m+1)=X(m)−W(n,k)Y(m)where X and Y are input signals and are discussed in more detail below. W(n, k) (the “twiddle factor”) is a complex value and is given by the formula:
      W    ⁡          (              n        ,        k            )        =            ⅇ              j        ⁢                              2            ⁢            xnk                    N                      .  
This complex function is periodic and for an FFT of a given size N, provides N/2 constant values. As discussed in more detail below, these values may be pre-calculated and stored in a memory.
FIG. 1 illustrates a traditional decimation in time FFT signal flow graph for an 8 input (8 point) FFT. An FFT algorithm will include log2N stages of calculations. Thus, the 8 point FFT signal flow graph 100 is divided into log28, or three, stages: the first stage 102, second stage 104 and the third stage 106, where each stage performs N/2 butterfly calculations. Thus, in FIG. 1, every stage of the signal flow graph 100 will calculate 8/2, or 4 butterfly calculations per stage. An examination of FIG. 1 also shows that the first stage provides four 2 point FFTs, the second stage provides two 4-point FFT's and the final stage provides one 8-point FFT. Thus, each stage will have a number of groups in which the FFTs are calculated. The number of groups per stage is given by:groups=2Log2(N)−m where N is the number of input data points, and m is the number of the stage and is from m=1 to m=log2N. Thus in FIG. 1, the first stage has 23-1 or 4 groups of FFTs, 108, 110, 112, and 114. The second stage has 23-2 or 2 groups of FFTs, 116, and 118. The final stage has 23-3 or 1 group of an FFT 120.
To compute an FFT on a computer, the signal flow graph 100 must be translated into a software program. A software program based on the traditional FFT signal flow graph will first typically re-order the data into a bit-reversed order as shown by the input data 122. Next, three loops that calculate FFT data are executed. The outermost loop, known as the stage loop, will be executed only for each stage. Therefore, for an N point FFT, there will be Log2N outer loops that must be executed. The middle loop, known as the group loop, will be executed a different number of times for each stage. As discussed above, the number of groups per stage will vary from 2log2(N)−m to 1 depending on the position of the stage in the algorithm. Thus for the early stages of the FFT the group loop will be entered into and out of many times in each stage. The inner most loop, known as the butterfly loop, will be executed N/2 times for each stage.
The FFT signal flow graph 100 also illustrates another aspect of the traditional FFT technique. The data that is provided by each butterfly calculator is stored in a different sequence in each stage of the FFT. For example, in the first stage 102 the input data is stored in a bit-reversed order. Thus, each butterfly calculator receives input data values that are stored in adjacent memory locations. In addition, each butterfly calculator provides output data values that are stored in adjacent memory locations in the sequence in which they are calculated. In the second stage 104 each butterfly calculation receives input data that is separated by 2 storage locations, and the output data values are stored in memory locations that are also 2 storage locations apart. In the third stage 106, each butterfly calculation receives data that is 4 storage locations apart and provides output data values that are also stored 4 storage locations apart. Thus, the distance between the storage locations where the output data values are stored (the stride) varies as a power of two from 20 to 2N/2 Thus, in the illustrative embodiment the stride varies between 1 and 4 as discussed above.
In a typical computing system, the most time consuming operations are the reading and writing of data to and from memory respectively. Since the FFT is a very data intensive algorithm, many schemes have been developed to optimize the memory-addressing problem. Typically memory systems have been designed to increase the performance of the FFT by changing the pattern of how the memory is stored, by using smaller faster memories for the data, or by dedicating specific hardware to calculate the desired memory locations. However, the very nature of the traditional FFT as shown in FIG. 1 illustrates the limitations of these approaches. For each stage, a new stride will have to be computed and, for each stage, there are only so many ways to change the pattern of the memory storage. Modern computer languages also allow the accessing of memory locations directly using “pointers”. Pointer arithmetic can be time consuming as well and the need to recalculate the pointer arithmetic for each stage is inefficient.
In addition to the data storage problem, traditionally, the control and overhead processing for a computer program takes up the bulk of the program memory, but only a small fraction of the actual processing time. Therefore, minimizing the control and overhead portions of a computer program is one method to further optimize the memory usage of the program. As discussed above, in the signal flow graph 100 the number of the stage loops and the butterfly loops to be executed are set by the system parameters, in particular the number of input data points used. The number of group loops to be executed however changes with each stage. In particular, in the early stages of the algorithm, the overhead and control software will be executing a large number of group loops each having a small number of butterfly loops for each stage. This entering and exiting of the group loops will result in a complex iteration space in which a large number of overhead and control instructions need to be executed, resulting in an inefficient program execution.
It would therefore be desirable to be able to compute an FFT in a manner that reduces the number of required iterations and simplifies the calculation of the storage locations of the output data values from each stage in memory.