Common orthogonal transforms provide a powerful tool in encoding information transmitted in wireless communication systems, and various ones of such transforms are used depending on the protocol used to transmit information. The FFT (Fast Fourier Transform)/IFFT (Inverse FFT), for example, is a critical computational block e.g. in OFDM systems and filter banks. See, for example, N. West, and D. J. Skellern, “VLSI for OFDM,” IEEE Communications Magazine, pp. 127-31, vol. 36, (no. 10), October 1998, and R. van Nee and R. Prasad, OFDM for Wireless Multimedia Communications, Artech House Publishers, 2000.
An attractive feature of FFT/IFFT is that IFFT can be performed using a FFT block, by conjugating the input and output of the FFT and dividing the output by the size of the processed vectors. Hence the same hardware can be used for both FFT and IFFT. Several standard implementations of performing FFT/IFFT are known, some of which provide reconfigurability. One standard FFT/IFFT implementation is using FFT kernel arithmetic.
FFT Kernel Arithmetic:
The digital computation of the N-point DFT (discrete Fourier transform) (see, for example, A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, Prentice Hill, N.J., 1989) is:
                                          X            ⁢                                                  [            k            ]                    =                                    ∑                              n                =                0                                            N                -                1                                      ⁢                                                  ⁢                                          X                ⁢                                                                  [                n                ]                            ⁢                                                          ⁢                              W                N                nk                                                    ,                  k          ∈                      [                          0              ,              N                        )                                              (        1        )            where the complex exponential coefficients are:
      W    b    a    =            ⅇ                        -          J2π                ⁢                  a          b                      .  
Direct computation of DFT (for all k) requires N×N multiplications and N×(N−1) additions. FFT algorithms are more efficient implementations that reduce the number of multiplications to N log2N. The basic idea is to divide the FFT of length N into two FFT components of length N/2, each of which is then further divided into two FFT components of length N/2, etc. This process continues until the length of each FFT component is reduced to 2, which can be computed directly by a so-called “butterfly” unit. The trellis of such a butterfly unit is illustrated in FIG. 1.
Two other commonly used FFT algorithms are decimation-in-frequency (DIF) and decimation-in-time (DIT) algorithm, which are similar in nature. The DIF algorithm is used to illustrate the architectural implementations where the FFT intermediate results are divided into even and odd parts with:
                                                                        X                ⁢                                                                  [                                  2                  ⁢                  r                                ]                            =                            ⁢                                                                    ∑                                          n                      =                      0                                                                                      N                        /                        2                                            -                      1                                                        ⁢                                                                          ⁢                                                            x                      ⁢                                                                                          [                      n                      ]                                        ⁢                                                                                  ⁢                                          W                      N                                              2                        ⁢                        rn                                                                                            +                                                      ∑                                          n                      =                                              N                        /                        2                                                                                    N                      -                      1                                                        ⁢                                                                          ⁢                                                            x                      ⁢                                                                                          [                      n                      ]                                        ⁢                                                                                  ⁢                                          W                      N                                              2                        ⁢                        rn                                                                                                                                                                                  ⁢                                                                                          ∑                                              n                        =                        0                                                                                              N                          /                          2                                                -                        1                                                              ⁢                                                                                  ⁢                                                                  x                        ⁢                                                                                                  [                        n                        ]                                            ⁢                                                                                          ⁢                                              W                        N                                                  2                          ⁢                          rn                                                                                                      +                                                            ∑                                              n                        =                        0                                                                                              N                          /                          2                                                -                        1                                                              ⁢                                                                                  ⁢                                                                  x                        ⁢                                                                                                  [                                                  n                          +                                                      N                            /                            2                                                                          ]                                            ⁢                                                                                          ⁢                                              W                        N                                                  2                          ⁢                                                      r                            ⁡                                                          (                                                              n                                +                                                                  N                                  /                                  2                                                                                            )                                                                                                                          ⁢                                                                                          ⁢                      r                                                                      ∈                                  [                                      0                    ,                                                                  N                        2                                            -                      1                                                        )                                                                                                                      ⁢                                                ∑                                      n                    =                    0                                                                              N                      /                      2                                        -                    1                                                  ⁢                                                                                                                                                                              ⁢                                                  (                                                                                    x                              ⁢                                                                                                                          [                              n                              ]                                                        +                                                          x                              ⁢                                                                                                                          [                                                              n                                +                                                                  N                                  /                                  2                                                                                            ]                                                                                )                                                                    ︸                                                              Butterfly                      ⁢                                                                                          ⁢                      upper                      ⁢                                                                                          ⁢                      branch                                                        ⁢                                                                          ⁢                                      W                                          N                      /                      2                                                              2                      ⁢                      rn                                                                                                                              (        2        )            and similarly,
                              X          ⁢                                          [                                    2              ⁢              r                        +            1                    ]                =                              ∑                          n              =              0                                                      N                /                2                            -              1                                ⁢                                                                                                      ⁢                                                                            (                                                                        x                          ⁢                                                                                                          [                          n                          ]                                                -                                                  x                          ⁢                                                                                                          [                                                      n                            +                                                          N                              /                              2                                                                                ]                                                                    )                                        ⁢                                                                                  ⁢                                          W                      N                      n                                                        ︸                                                            Butterfly                ⁢                                                                  ⁢                lower                ⁢                                                                  ⁢                branch                                      ⁢                                          W                                  N                  /                  2                                                  2                  ⁢                  rn                                            .                                                          (        3        )            
Standard Implementation:
In the standard prior art approach, to provide function-specific re-configurability it is first necessary to analyze the computational structure. The FFT can be viewed as a shuffle-exchange interconnecting network of butterfly blocks, which varies with the size of the FFT, thus making it difficult to support flexibility of the most energy-efficient fully-parallel implementation. In the fully parallel implementation the signal flow graph can be directly mapped onto hardware. For instance, for a 16-point FFT there are total of 32 butterfly units and they are interconnected in the manner as shown by the trellis in FIG. 2. In general, the N-point FFT requires
      N    2    ⁢      Log    2    ⁢  Nbutterfly units. This maximally parallel architecture has the potential for high performance and low power consumption, however it bears a high cost of large silicon area especially for large FFT sizes.
The outputs generated by DIF FFT are bit-reversed. For example, X[10]=X[10102]=Y[01012]=Y[5].
When the implementation is done in fixed-point arithmetic the scaling and overflow handling are crucial for the correct behavior of the transformer. The butterfly operation at each stage of the FFT involves both complex addition and complex multiplication. Each complex addition is composed of two real additions, which expand the input word-length by 1 bit. Each complex multiplication is composed of four real multiplications and two real additions. A real multiplication doubles the input word-length. Thus to ensure the correct behavior, the output word-length is either increased to (M+1) bits, or the output needs to be truncated or rounded to M bits. If truncation is performed, the most significant bit of the output is simply discarded, by truncating the values to the maximum values that can be described by M bits. If rounding is performed, a “1” is added to the positive outputs first before the output is shifted to the right by 1 bit, and the least significant bit is discarded. Rounding will not cause adder overflow since the biggest and smallest numbers (a+b) have their least significant bit, after the addition, to be zero (even numbers). After rounding, the output will be in the same range as that of a and b, e.g., M bits.
Column Based Approach:
In a column-based FFT architecture, the computations are rearranged such that the interconnections are kept identical in every stage as shown by the trellis in FIG. 3. Since the inputs to a butterfly are no longer needed once the outputs are computed, the outputs can be routed to the inputs of the same butterflies, with the same butterflies thus being reused for the next and successive stages in iterative way (in-place computation). As a result, only a single column of butterflies is needed, the column being reused (time-multiplexed) by the different stages of computation. The FFT coefficients, however, need to be changed from stage to stage. In general, an N-point FFT needs N/2 butterfly units, e.g. 8 butterflies are needed for a 16-point FFT. Its power consumption is very close to the a fully parallel architecture, but it requires less area. Still to convert it to a reconfigurable design is a complicated task, since the simple iterative structure is optimized for a specific size. The transition from a parallel to a column based implementation requires more clocks for processing an FFT frame. Indeed the parallel approach allows processing of a full FFT frame in one clock cycle, while the column approach needs log2N (when using a radix-2 based butterfly architecture) clock cycles due to the iterative time-multiplexed structure.
Reconfigurable Design:
By choosing a regular pipelined architecture to run an FFT algorithm, it is possible to implement a reconfigurable design with very low energy overhead even compared with the one provided by the standard lower boundary of the complexity of a FFT transform.
Pipelined Approach:
In the regular pipelined architecture, only one butterfly unit is used for each stage, yielding the total complexity log2N, compared to N/2×log2N in the fully-parallel approach and N/2 in the column-based approach. An example of the pipeline approach is illustrated in FIG. 4 for the length of a 16-point FFT. The multiplier 40 of each stage 42a, 42b and 42c is distinguished from the butterfly unit 44a, 44b and 44c to distinguish between hardware requirements. Each of the butterfly units 44a, 44b, 44c and 44d is time-multiplexed among the N/2 butterfly computations for each stage. For the stage including the butterfly unit 44c, the multiplier 40c is “j”. No multiplier is necessary for the out of the final butterfly unit 44d. The pipelined-based implementation needs more clock cycles per FFT frame than the column-based approach since the pipelined-based approach can implement a full FFT frame in N (when using radix-2 based butterfly architecture) clock cycles, while the column approach needs log2N (when using radix-2 based butterfly architecture) clock cycles due to the iterative time-multiplexed structure. In hardware implementation of all stages the clock number for processing an FFT frame is not an obstacle since the data is inserted in a serial manner, frame by frame, and the number of clock cycles per frame is transformed into a constant initial delay, while the throughput remains high.
The single-path delay feedback (SDF) implementation, see, for example, E. H. Wold and A. M. Despain, “Pipelined and parallel-pipeline FFT processors for VLSI implementation,” IEEE Trans. Comput., p. 414-426, May 1984, uses memory more efficiently by storing the butterfly outputs in feedback shift registers or FIFO's 46 (their sizes are given in FIG. 4, in the example the lengths of the registers are 8, 4, 2, and 1, correspondingly). A single data stream passes the multiplier at every stage.
Hybrid Approach
The hybrid approach combines benefits of the column and feedback approaches. It uses elements of the feedback approach to save memory, and the column stages are used for better hardware utilization. Use of the column stage butterfly units of 4 bits' width can be combined with employing a greater BUS width and proper reconfigurable multipliers. The architecture can also be converted to one with an exact BUS width necessary for high space utilization and algorithmic efficiency.
A popular architecture for running an iterative process is shown in FIG. 5. This FFT implementation utilizes a single butterfly unit 50. The single butterfly unit design is mainly focused on optimizing a scheduling and memory access scheme, i.e., providing a pipeline approach when implementing each of the stages by reusing the same butterfly unit, time-multiplexed in an iterative way. The Spiffee processor, see for example, B. M. Baas, “A Low-power, high-performance, 1024-point FFT processor,” IEEE Journal of Solid-State Circuits, March 1999, is an example of using cached memory architecture, including RAM 52 and multiplier 56, to exploit the regular memory access pattern of a FFT algorithm in order to achieve low power consumption. The processor, shown as controller 54, can be programmed to perform any length of FFT, but certain features, such as cache sizes provided by RAM 52, are optimized only for a certain FFT size, and this approach operates at very low speeds because the N clock cycles needed for the computation of a FFT frame through the full implementation of the pipeline algorithm, yielding a constant initial delay. This means that due to the iterative time-multiplexing of the stages by the reused butterfly unit 50, the full frame needs to be computed (needs N clock cycles when using a radix-2 based butterfly unit) before it can begin to handle the next FFT frame.
One can make a more efficient FFT processor by using a larger radix-based butterfly unit, e.g. the Radix-4 based architecture. This reduces the computation clock cycle that is needed for processing a full FFT frame to N/2. Most of the FFT accelerators that are implemented in advanced DSPs and chips are based on the Radix-2 or Radix-4 FFT processors. They have a limited usage (only for FFTs transforms), very low speed utilization and suffer from the need of high clock rate design.
Filter Implementation Based on Multiplex Pipelined Approach:
Using reconfigurable iterative schemes, such as the one shown in FIG. 6, one can implement any kind of filter or correlation function with high efficiency. It is achieved by using the multiplier of the last stage of a FFT transform for multiplication by a filter coefficient (time domain multiplication) followed by an IFFT as best seen in FIG. 6 at 60. It is also efficient in implementing any sub-product of a FFT/IFFT, e.g. Discrete Cosine/Sine Transforms (DCT and DST), and any algorithms which are a combination of the above-mentioned algorithms, like filtering using cascaded FFT and IFFT algorithms (which can be used also for equalization, prediction, interpolation and computing correlations).
FFT with Different Radixes:
The radix-22 algorithm is of particular interest. It has the same multiplicative complexity as radix-4 and split-radix algorithms respectively, while retaining a regular radix-2 butterfly structure. This spatial regularity provides a great structural advantage over other algorithms for VLSI implementation. The basic idea behind the radix-22 algorithm is in taking two stages of the regular DIF FFT algorithm and maximizing the number of trivial multiplications by
            W      N              N        4              =          -      j        ,which involves only real-imaginary swapping and sign inversion. In other words, the FFT coefficients are rearranged and non-trivial multiplications are lumped into one stage so that only one complex multiplier is needed in every two stages (reduces the overall logic area). FIG. 7 illustrates a trellis representing such a coefficient rearrangement (in parallel form): for any two butterfly coefficients
            W      N      i        ⁢                  ⁢    and    ⁢                  ⁢          W      N              i        +                  N          4                      ,      W    N    i  is factored out and forwarded to the next stage, which leaves the coefficients 1 and −{tilde over (j)} in the corresponding positions. After performing this coefficient rearrangement over all the coefficient pairs, one stage is left without non-trivial multiplication.
Hybrid Pipeline/Multiplex Approach:
A number of pipelined FFT architectures have been proposed over the last decade. Since the spatial regularity of the signal flow graph is preserved in pipelined architectures, they are highly modular and scalable. The shuffle network 80 is implemented through a single-path delay feedback depicted in FIG. 8A, where the data is processed between stages 82 in a single path and feedback FIFO registers 84 are used to store new inputs and intermediate results. The basic idea behind this scheme is to store the data and scramble it so that the next stage can receive data in the correct order. When the FIFO registers 84 are filled with the first half of the inputs, the last half of the previous results are shifted out to the next stage. During this time, the operational elements are bypassed. When the first half of the inputs are shifted out of the FIFO registers, they are fed into the processing elements along with the arriving second half of inputs. During this time, the operational elements are working and generating two outputs, one directly fed to the next stage 82 and the other shifted into the corresponding FIFO registers. Multipliers (not shown) are inserted between stages when necessary according to either the radix-22 or the radix-2 algorithm. A trellis and data packets for use in such an implementation is illustrated in FIGS. 8B and 8C, respectively.