The discrete Fourier transform (DFT) implementation of the FFT is an important block in many digital signal processing applications, including those which perform spectral analysis or correlation analysis. The purpose of the DFT is to compute the sequence {X(k)}, having N complex-valued numbers, given another sequence {x(n)} also of length N, where {X(k)} is expressed by the formula:
            X      ⁡              (        k        )              =                  ∑                  n          =          0                          N          -          1                    ⁢                          ⁢                        x          ⁡                      (            n            )                          ⁢                  W          N          kn                      where            W      N        =          ⅇ                        -          j2π                N            
It can be observed from these formulae that for each value of k, a direct computation of X(k) involves N complex multiplications and N−1 complex additions. Thus, to compute the DFT, X(k) must be computed for N values of k, which Would require N2 complex multiplications and N2−N complex additions.
This general form solution can be decomposed using a divide-and-conquer approach, where the most commonly used decimating factors are 2 or 4 (leading to the “radix-2” or “radix-4” FFT implementations of the DFT). An example of a discussion of this implementation may be found in Digital Signal Processing: Principles, Algorithms and Applications, by J. G. Proakis and D. G. Manolakis, Prentice-Hall Publishing Inc., 1996.
In such a divide-and-conquer approach, the computation of the DFT is performed by decomposing the DFT into a sequence of nested DFTs of progressively shorter lengths. This nesting and decomposition is repeated until the DFT has been reduced to its radix. At the radix level, a butterfly operation can be performed to determine a partial result which is provided to the other decompositions. Twiddle factors, which are used to perform complex rotations during the DFT calculation, are generated as the divide-and-conquer algorithm proceeds. For a radix-2 decomposition, a length-2 DFT is performed on the input data sequence {x(n)}. The results of the first stage of length-2 DFTs are combined using a length-2 DFT and then the resulting value is rotated using the appropriate twiddle factors. This process continues until all N values have been processed and the final output sequence {X(k)} is generated. FFT processors performing the above process are commonly implemented as dedicated processors in an integrated circuit.
Many previous approaches have improved the throughput of FFT processors while balancing latency against the area requirements through the use of a pipeline processor-based architecture. In a pipeline processor architecture, the primary concern from the designer's perspective is increasing throughput and decreasing latency while attempting to also minimize the area requirements of the processor architecture when the design is implemented in a manufactured integrated circuit.
A common pipeline FFT architecture achieves these aims by implementing one length-2 DFT (also called a radix-2 butterfly) for each stage in the DFT recombination calculation. It is also possible to implement less than or more than one butterfly per recombination stage. However, in a real-time digital system, it is sufficient to match the computing speed of the FFT processor with the input data rate. Thus, if the data acquisition rate is one sample per computation cycle, it is sufficient to have a single butterfly per recombination stage.
A brief review of pipeline FFT architectures in the prior art is provided below, in order to place the FFT processor of this invention into perspective.
In this discussion, designs implementing the radix-2, radix-4 and more complex systems are described. Input and output order is assumed to be the most appropriate form for the particular design. If a different order is required, an appropriate re-ordering buffer (consisting of both on-chip memory and control circuits) can be provided at the input or output of the pipeline FFT, which is noted as a “cost” of implementation as that adds complexity or uses additional area on chip.
FFT implementations that accept in-order input data are most suitable for systems where data is arriving at the FFT one sample at a time. This includes systems such as wired and wireless data transmissions systems. Out-of-order input handling is most appropriate when the input data is buffered and can be pulled from the buffer in any order, such as in an image analysis system.
All of the discussed architectures are based on the Decimation-in-Frequency (DIF) decomposition of the DFT. Input and output data is complex valued as are all arithmetic operations.
For the radix-2 designs, a constraint that N is a power of 2 applies, and for the radix-4 designs, a constraint that N is a power of 4 applies. For simplification of algorithmic analysis, all of the control and twiddle factor hardware has been omitted. Because the control hardware plays a minor role in the overall size of the FFT this is acceptable for a coarse comparison of the architectures.
FIG. 1 illustrates a conventional Radix-2 Multi-path Delay Commutator (“R2MDC”) pipeline FFT processor. The R2MDC approach breaks the input sequence into two parallel data streams. In each butterfly module, one of which is labeled 100, a commutator 102 receives the data stream as input and delays half of the data stream with memory 104. The delayed data is then processed with the second half of the data stream in a radix-2 butterfly unit 106. Part of the output of the butterfly unit 106 is delayed by buffering memory 108 prior to being sent to the next butterfly module. In each subsequent butterfly module the size of both memory 104 and 108 are halved. The processor of FIG. 1 implements a 16-point R2MDC pipeline FFT. In terms of efficiency of design, the multipliers and adders in the R2MDC architecture are 50% utilized. The R2DMC architecture requires 3/2 N-2 delay registers.
A Radix-4 Multi-path Delay Commutator (“R4MDC”) pipeline FFT is a radix-4 version of the R2MDC, where the input sequence is broken into four parallel data streams. In terms of efficiency of design, the R4MDC architecture's multipliers and adders are 25% utilized, and the R4MDC designs require 5/2 N4 delay registers. An exemplary 256-point R4MDC pipeline implementation is shown in FIG. 2. The FFT processor of FIG. 2 is composed of butterfly modules, such as butterfly module 110. Butterfly module 110 includes commutator 112 with an associated memory 114, butterfly unit 116 and an associated memory 118. The commutator 112 orders samples and stores them in memory 114. When memory 114 is sufficiently full, three samples are provided from memory 114 along with one sample from commutator 112 to the radix-4 butterfly unit 116. A standard radix four butterfly operation is performed on the samples, and the results are provided to a subsequent commutator, after some of them have been buffered in memory 118. The use of memories 114 and 118 ensure in order delivery of the samples between butterfly units.
A Radix-2 Single-path Delay Feedback (“R2SDF”) pipeline FFT design uses the memory registers more efficiently than the R2MDC implementation by storing the butterfly output in feedback shift registers. In terms of efficiency, R2SDF designs achieve 50% utilization of multipliers and adders and require N-1 delay registers , which are fully utilized. FIG. 3 shows the basic architecture of a prior art R2SDF for a 16-bit FFT. A butterfly module is composed of the radix-2 butterfly unit, such as butterfly unit 120, and its associated feedback memory 122. The size of the memory 122a-122d in a butterfly module varies with the position of the module in the series. Butterfly unit 120 receives an input series of 16 samples, and buffers the first 8 samples in feedback memory 122a. Starting with the ninth sample in the series, butterfly unit 120 serially pulls the stored samples from feedback memory 122a and performs butterfly operations on the pair-wise samples. The in order output is provided to the next butterfly module by storing out of order outputs in the feedback memory 122a until they can be provided in order.
A Radix-4 Single-path Delay Feedback (“R4SDF”) pipeline FFT is a radix-4 version of the R2SDF design. The utilization of the multipliers increases to 75% in implementation, but the adders are only 25% utilized, while the design will require N-1 delay registers. The memory storage is fully utilized. A 256-point R4SDF pipeline example from the prior art is shown in FIG. 4. The structure of the processor of FIG. 4 is similar to that of FIG. 3, with butterfly modules being composed of a radix-4 butterfly unit, such as BF4 124, and an associated feedback memory 126. The size of feedback memory 126 decreases from 126a-126d in accordance with the amount of separation required between samples. The butterfly modules of FIG. 4 function in the same fashion as those of FIG. 3, with additional samples being stored in feedback memory 126 in each cycle.
A Radix-4 Single-path Delay Commutator (“R4SDC”) uses a modified radix-4 algorithm to achieve 75% utilization of multipliers, and has a memory requirement of 2N-2. A prior art 256-point R4SDC pipeline FFT is shown in FIG. 5. FIG. 5 has single input single output butterfly modules, such as butterfly module 127. In butterfly module 127a single input is provided to commutator 128 which stores and reorders samples using an internal memory. Commutator 128 provides the samples four at a time to radix four butterfly unit 129. The output of butterfly unit 129 is serially provided to the next butterfly module.
A Radix-22 Single-path Delay Feedback (“R22SDF”) pipeline FFT design breaks one radix-4.butterfly operation into two radix-2 butterfly operations with trivial multiplications of ±1 and ±j in order to achieve 75% multiplier utilization and 50% adder utilization, with memory requirements of N-1. The architecture of an exemplary 256-point R22SDF implementation is illustrated in FIG. 6. Butterfly modules are composed of butterfly units such as BF2I 130 and an associated feedback memory such as memory 131. Butterfly unit 130 receives a series of input samples and buffers the first set of samples in memory 131, then performs pairwise butterfly operations using stored samples and the incoming series. The operation of this processor is functionally similar to that of the processor of FIG. 4 with the differences noted above.
U.S. patent application Publication No. 2002/0178194A1 to Aizenberg et al. teaches the calculation of a single twiddle factor before moving onto a new twiddle factor during computation. It uses a single butterfly which uses both a true adder and an arithmetic logic unit (ALU). The advantage of the resulting circuit is a reduction in the implementation area which comes at the cost of reduced data throughput.
U.S. patent application Publication No. 2002/0083107A1 to Park et al. teaches the use of a radix-4 and radix-2 butterfly units together to reduce the number of complex multiplications performed. It uses the multi-path delay commutator architecture, or single-path delay commutator architecture.
U.S. Pat. No. 6,408,319 to Cambonie teaches a memory architecture based upon the radix-4 architecture. The memory allocation in this application is based on a loopback architecture similar to the single path delay feedback (SDF) architecture. Furthermore this patent teaches the use of a single-access memory. SDF architectures, such as this one, have sub-optimal adder requirements in their implementation.
U.S. Pat. No. 5,694,347 to Ireland teaches an architecture based on a decimation in time algorithm for the FFT. The butterfly disclosed is large in comparison to other butterflies and does not offer additional throughput or a reduction in the area of other components in the system.
The prior art includes trade-offs in design implementation. Trade offs are made among implementation area, power consumption, complexity, and data throughput. Although some innovation has taken place in the area of altered algorithms, including the use of hybrid single butterfly/pipelined throughput, and novel addressing schemes the prior art FFT processors do not provide satisfactory implementation area, and power consumption without incurring high degrees of complexity and impairing throughput.