This disclosure relates to systems and methods for providing power and bandwidth efficient Fast Fourier Transform (FFT) architectures in a device, for example, an application-specific standard product (ASSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), a full-custom chip, or a dedicated chip.
There are many well known FFT algorithms, such as the Constant Geometry FFT algorithm, the Cooley-Tukey FFT algorithm, the Prime-Factor FFT algorithm, Bruun's FFT algorithm, Rader's FFT algorithm, or Bluestein's FFT algorithm, which each lends itself to hardware implementation. Although the embodiments disclosed herein are primarily discussed within the context of the Constant Geometry FFT algorithm for clarity, other FFT algorithms, or variants thereof, may also be used.
An FFT calculation includes reading an input data sequence with data samples x[n], n=0, . . . , N−1, where N is the length of the input data sequence, and outputting the frequency domain FFT data sequence with data samples X[k], k=0, . . . , N−1. Such a calculation is conventionally called an N-point FFT. FFT algorithms use a divide and conquer approach to reduce the computational complexity of calculating an FFT. For example, the Cooley-Tukey algorithm recursively decomposes the problem of calculating the FFT into two sub-problems of half the size (i.e., N/2) at every intermediate pass. The size of the FFT decomposition is known as the radix. In the above example, the radix is 2. This decomposition approach generally works provided that N is a power of 2. Thus, calculating an FFT typically involves making a number of passes (also referred to as stages) over the input data sequence and intermediate results. In general, each pass can be associated with a different radix.
A number of applications have emerged recently that make use of long length FFTs. However, programmable devices typically have a relatively limited amount of on board memory to support such long length FFT calculations. For example, a 1 million point double precision floating point FFT requires 128 million bits of data memory for the storage of one pass of the FFT. Therefore, external memory may be required for storing data required for calculating these types of FFTs.
Many current FFT implementations utilize external memory, e.g., Static Dynamic Random Access Memory (SDRAM) which is generally inexpensive, for storing the data required for calculating the FFT. Conventional implementations access read data from the external memory in an out of order fashion. This can be inefficient in terms of power consumption and memory bandwidth when SDRAM is used. Other FFT implementations may utilize Reduced-Latency Dynamic Random Access Memory (RLDRAM) or Quad Data Rate Static Random Access Memory (QDRSRAM) memories. Accessing read data in an out of order fashion is not as inefficient when RLDRAM and QDRSRAM are used; however, RLDRAM and QDRSRAM are generally expensive. In addition, regardless of the type of external memory used, I/O and memory interface bandwidth resources, required to utilize the external memory, may be expensive. One of the challenges in FFT implementations is the handling of data from memory in a manner that scales well as the FFT length increases.
As an example, consider the calculation of a 64-point FFT using the radix R=4. For computing the FFT, an FFT processor conventionally processes the input data sequence in the order where the indices corresponding to the data samples are arranged in the following order:
00, 16, 32, 48, 01, 17, 33, 49, 02, 18, 34, 50, 03, 19, 35, 51, 04, 20, 36, 52, . . . , 15, 31, 47, 63.
This order of data samples is referred to as a radix-reversed order. In the first pass of the FFT calculation, data samples corresponding to indices 00, 16, 32, and 48 are used to compute a first radix-4 butterfly; data samples corresponding to indices 01, 17, 33, and 49 are used to compute the next radix-4 butterfly; and so on. An FFT butterfly is a portion of the FFT calculation that breaks up the larger FFT calculation into smaller sub-transform calculations. Each radix-R butterfly may itself be an FFT of size R. From the point of view of the utilization of the memory interface processing the input data sequence in the radix-reversed order is inefficient in terms of both power and bandwidth because reading the input data sequence in the radix-reversed order requires accessing data from external memory in an out-of-order memory access pattern. In addition, although reading data samples requires out-of-order accesses to external memory, the final FFT calculation results are still written to memory in a sequential pattern.
Accessing the input data sequence from external memory can be inefficient because row change overheads are incurred in accessing data from external memory, particularly from SDRAM memory. Specifically, rows in the external memory must be activated and pre-charged before they can be read from or written to which causes inefficiencies. Moreover, consecutive accesses to two different rows in the same memory bank of the external memory also lead to inefficiencies.