1. Field of the Invention
The present invention relates to digital signal processing, and, in particular, to implementation of Fast-Fourier Transform (FFT) butterfly calculation.
2. Description of the Related Art
Programmable digital signal processor implementations exhibit modest performance relative to dedicated hardware digital signal processor implementations when calculating a Fast Fourier Transform (FFT). Mobile communications systems require fast FFT calculation within a programmable processor platform that performs a variety of other digital signal processing and control duties. One desirable operation of fast FFT calculation is to perform the kernel calculation of the FFT, known as the radix-2 butterfly, in a single processor clock cycle.
The complex radix-2 butterfly requires the following calculations of equations (1a) and (1b):An+1=An+Bn*Wk and  (1a)Bn+1=An−Bn*Wk,  (1b)where An and Bn are complex coefficient values at stage n, and Wk is a complex-valued coefficient commonly known in the art at the “twiddle factor”. The twiddle factor refers to the trigonometric constant coefficients Wk, k=0, 1, 2, . . . , K, that are multiplied by the data in the course of the algorithm. The coefficients are root-of-unity complex multiplicative constants in the butterfly operations of the Cooley-Tukey FFT algorithm, well-known in the art of signal processing, that are employed to, recursively combine smaller discrete Fourier transforms.
Each butterfly calculation has two inputs (An and Bn) and two outputs (An+1 and Bn+1). The overall butterfly calculation requires one complex multiplication, one complex addition, and one complex subtraction. Defining the real components (AR, BR) and imaginary components (AI, BI) of the coefficients individually, the equation (1a) and (1b) expand to the following equations (1a′) and (1b′):(AR+j AI)n+1=(AR+j AI)n+(BR+j BI)n*(WRk+j WIk)  (1a′)(BR+j BI)n+1=(AR+j AI)n−(BR+j BI)n*(WRk+j WIk)  (1b′)
Typically, programmable architectures perform the FFT butterfly in 2 clock cycles. Other solutions accelerate the FFT by using a higher radix implementation of the algorithm on a Very Long Instruction Word (VLIW) machine. In either case, these architectures exhibit one or more of the following weaknesses: inferior performance, an inflexible hard-wired architecture, the need for a large register set and therefore wider instruction words, and/or the use of a higher FFT radix, all of which sacrifice flexibility in the size of the FFT operation performed (herein, “size” of the FFT refers to the value N of the N-point FFT algorithm, where N is the integer number of input/output data points).
FIG. 1 illustrates a data structure associated with an N-point FFT where N is eight. In FIG. 1, each of circles 102(a)-(d), 103(a)-(d), and 104(a)-(d) represents a butterfly calculation, and the input complex data points from memory are numbered 0 through 7. A butterfly calculation is performed, for example, by 102(a) on input data points 0 and 4. In all stages 0, 1, and 2 (i.e., all stages except the last stage 3), each butterfly calculation receives inputs from, and provides outputs to, non-adjacent memory addresses. Input data gathering and result data scattering illustrated by FIG. 1 complicates efficient processing of an FFT under the constraints of practical circuit design.