Fourier transformation is a well-known technique for analyzing time varying signals. In simple terms, the Fourier transformation converts a signal from a time varying format to a frequency varying format. The inverse Fourier transform performs the opposite conversion. When a signal is expressed in discrete form by a series of successive signal samples taken at regular time periods, the corresponding Fourier transformation is referred to as discrete Fourier transform (DFT).
At a relatively high level, the DFT is a simple algorithm. It consists of stepping through digitized data points of an input function, multiplying each data point by sine and cosine functions, and summing the resulting products, one for the sine component and another for the cosine component in corresponding accumulators. When every data point has been processed in this manner, the sine and cosine accumulators are divided by the number of data points processed. The resulting quantities are the average values for the sine and cosine components of the frequency currently being investigated. This process is repeated for all integer multiple frequencies up to the frequency equal to twice the Nyquist frequency.
In more formal terms, the DFT and inverse DFT are defined as follows: ##EQU1## where F(f)=frequency components or transform
f(T)=time base data points or inverse transform PA1 N=number of data points PA1 T=discrete times PA1 f=discrete frequencies PA1 W.sub.N =e.sup.-j2.pi./N =Cos(2.pi./N)-j Sin(2.pi./N).ident."twiddle factor "
Thus, the twiddle factor is a complex number, and in the general case, both the frequency domain and the time domain functions may be complex numbers. Multiplication of two complex quantities yields the following terms: EQU (A+jB)(C+jD)=AC+jAD+jBC-BD=(AC-BD)+j(AD+BC) (3)
The term (A+jB) may be viewed, for example, as the time domain function, and the term (C+jD) may be viewed as W.sub.N, i.e., W.sub.N =Cos (2.pi./N)+j Sin (2.pi./N).
The practical problem with the DFT is that it takes so long to compute. In fact, executing a DFT requires performing on the order of N.sup.2 complex operations for N data points. A complex operation includes evaluating sine and cosine functions, multiplying by the data point, and adding those products. This problem is particularly troublesome in applications where there may be tens of thousands of data points to transform in "real time." On the other hand, if the number of data points is reduced, the number of operations will be reduced as the square. Thus, splitting the data sequence into two equal parts and processing each part separately saves computing half the operations. This is the approach used to develop the fast Fourier transform (FFT). The input data array is divided into smaller and smaller arrays to reduce the amount of the computation and then the transform results are recombined using a characteristic crossover pattern called a "butterfly" which is really a small FFT. The size of the butterflies in an FFT is called the FFT's "radix" (R). Thus, if a large DFT is replaced by multiple small DFTs, e.g., butterflies with a size of 2 or 4, the number of complex operations is substantially reduced. Even if the number of operations decreases as the DFT's size is reduced, that DFT size reduction "costs" on the order of N operations--thus, the familiar NlogN computation complexity for the FFT.
FIG. 1A illustrates an example FFT butterfly signal flow diagram for an N=8 data point array D0-D7. The even components of the array are input to a first 4-point DFT (half the size of an 8-point DFT), and odd data points D1, D2, D5, and D7 are input to a second 4-point DFT. The outputs of the two four-point arrays are combined to generate the eight-point sequence corresponding to an eight-point DFT by repeating each set of four frequency components a second time and then summing the even and odd sets together. However, before the summation, the odd DFT frequency components must be phase shifted because the odd terms in the time domain were shifted by one data point. The phase shift is indicated by the various blocks and ranges from zero to 2.pi. radians in increments of .pi./4 radians.
This divide-and-conquer approach can be extended as shown in FIG. 1B where each of the four-point DFTs is split into two, 2-point DFTs. Of course, then the four 2-point DFTs must be combined into two 4-point DFTs which are combined as described above into a single 8-point DFT. The total processing time is again reduced almost by half.
Accordingly, the 8-point FFT input data is divided into subsets of only two or four data points upon which two or four point discrete Fourier transforms are performed. The transform outputs are multiplied by appropriate "twiddle factors," and then subjected to further two or four point Fourier transformation.
FFT computations of high speed digital signals in real-time are important for many signal processing systems and applications. Asymmetrical digital subscriber line (ADSL), digital audio broadcasting (DAB), digital video broadcasting (DVB), multi-carrier modulation (MCM) schemes, of which orthogonal frequency division multiplexing (OFDM) is one, sonar, radar, block-based filtering and fast convolution, decimated filter banks, equalizers for magnetic storage, echo cancellers, and multi-path equalization are examples of high speed FFT applications. FFT processors also find application for example in digital mobile cellular radio systems where both power consumption and IC chip size should be minimized. The more power consumed, the more heat produced. There is an upper limit for chip size, and there is also an upper limit for how much power that can be used in a specific IC encapsulation. Reduced power consumption makes it possible to use cheaper IC encapsulation of the chip. These are among the most important factors to consider in building a one chip processing device such as an FFT processor.
While processing speed is of course important, e.g., for real time applications, power consumption also increases with the number of multiplications, additions, and register operations performed. IC chip area increases with the number of hardware components like multipliers, adders, and registers that are used. The goal of the present invention is to minimize the number of components and the amount of operations performed to minimize IC chip area and power consumption.
There have been many different approaches to increase speed and/or minimize power consumption and IC chip area requirements. One of the most successful approaches is to pipeline the process. A pipelined processor divides the computing load into successive stages allowing parallel processing. In essence, pipeline operation enables a partial result, obtained from a preceding stage of the processor, to be immediately used in a following stage without delay. A real-time, pipelined processor's processing speed must match the input data rate, i.e., the data acquisition speed for continuous operation. This means that an FFT pipelined processor must compute an N length DFT in N clock cycles since the data acquisition speed is one sample per cycle.
One proposed pipelined FFT architecture for very large scale integration (VLSI) is disclosed in WO 97/19412 published May 29, 1997 in the name of Shousheng He. The proposed pipelined FFT architecture is a single-path, delay-feedback (SDF), radix-2 FFT where twiddle factors are decomposed to form a radix-4 structure. A radix-2.sup.2 has the same multiplicative complexity as a radix-4 algorithm, but retains a radix-2 butterfly structure.
The mathematical details of how He decomposes the total multipliers into trivial and non-trivial multipliers are described in WO 97/19412. Architecturally, a real-time, pipeline FFT processor like He's is shown in FIG. 2A for 256 data points, i.e., N=256. More specifically, the input data sequence is passed to the first pair of a pair butterfly units 9 and 10. A 128-word feedback register 1 links the output of butterfly 9 to its input. The second butterfly unit 10 has a sixty four word feedback register 2. Multiplier 17 links the first stage of the processor, comprising butterfly units 9 and 10, to the second stage of the processor comprising butterfly units 11 and 12, and multiplies the data stream by a twiddle factor W1(n). The structure of butterfly units 9, 11, 13, and 15, differs from butterfly units 10, 12, 14, and 16 as illustrated in FIGS. 2B and 2C, respectively. Butterfly units 11 and 12 are provided with feedback registers 3 and 4 having a thirty two word and a sixteen word capacity, respectively. A multiplier 17, located between the second and third stage of the processor, multiplies the data stream by a twiddle factor W2(n). The third stage of the processor comprises butterflies 13 and 14, eight word feedback register 5, and four word feedback register 6. A multiplier 17, located between the third and fourth stages, of the processor multiplies the data stream by a twiddle factor W3(n). The fourth stage of the processor comprises butterfly units 15 and 16, with two word feedback register 7, and one word feedback register 8. The output sequence X(k) is derived from the output of the fourth stage of the processor. The binary counter 18, clocked by a clock signal 19, acts as a synchronization controller and address counter for the twiddle factors used between each stage of the processor. The type BF2I butterfly illustrated in FIG. 2B includes two adders 21, two subtractors 22, and four multiplexers 23. Operation of the multiplexers is controlled by control signal 27. The type BF2II butterfly, illustrated in FIG. 2C, is similar in construction to the type BF2I butterfly, but includes a 2.times.2 commutator 26 and a logic gate 24, i.e., an AND gate with one inverted input. Control signal 25 is applied to the inverted input of AND gate 24, and control signal 27, which is also applied to the multiplexers 23, is applied to the non-inverted input of AND gate 24. The output from AND gate 24 drives commutator 26.
The operation of the radix-2.sup.2 single delay feedback FFT processor in FIG. 2A is as follows. On the first N/2 cycles, the 2-to-1 multiplexers 23 in the first butterfly module switch to position "0," and the butterfly is idle. The input data from the left is directed to the feedback shift registers until they are filled. On the next N/2 cycles, the multiplexers 23 turn to position "1," the butterfly unit computes a 2-point DFT with the incoming data and the data stored in the shift registers. EQU Z1(n)=x(n)+x(n+N/2) (4) EQU 0.ltoreq.n&lt;N/2Z1(n+N/2)=x(n)-x(n+N/2) (5)
The butterfly output Z1(n) is sent to apply the twiddle factor and Z1(n+N/2) is sent back to the shift registers to be "multiplexed" in next N/2 cycles when the first half of the next frame of the time sequence is loaded.
The operation of the second butterfly is similar to that of the first one, except the "distance" of the butterfly input sequence is just N/4, and the trivial twiddle factor multiplication is implemented by real-imaginary swapping by commutator 26 and controlled add/subtract operations. This requires a two bit control signal 25 and 27 from the synchronizing counter 18. The data then passes through a full complex multiplier 17, working at 75% utility, to produce the results of the first level of the radix 4 FFT word-by-word. Further processing repeats this pattern with the distance of the input data decreasing by half at each consecutive butterfly stage. After N-1 clock cycles, the complete DFT transform result X(k) is output in bit-reversed order. The next frame of the transform is then processed without pausing because of the pipelined processing at each stage of the processor.
The WO 97/19412 application to He contends that this radix-2.sup.2 SDF FFT processor architecture is the most optimal for pipelined FFT computation. However, even greater reductions in FFT processor IC area and power consumption may be achieved using the present invention.
The computation of a large DFT using multiple, small DFTs (i.e., the FFT, divide-and-conquer principle) is a multi-stage process that may be implemented in an iterative or a pipelined architecture. Even though this divide-and-conquer strategy saves computations, there is an increased number of complex twiddle factor multiplications performed between the smaller stages. Two point or four point DFTs/butterflies are desirable from the standpoint that the twiddle factor multiplications performed in each butterfly are trivial because the multiplier coefficients are simply .+-.1 or j. In other words, complex number multiplication circuits are not needed for the individual transformations in the two or four point DFTs--only "trivial" multiplications. Trivial multiplications are performed without multipliers simply by passing the data through with no operation, or by changing a sign, or by switching real and imaginary components. In other words, nontrivial, computationally expensive multiplications are avoided. However, a substantial number of nontrivial, complex number multiplications are necessary for twiddle factor multiplications between 2-point and 4-point DFT butterfly stages in the FFT.
Conventional thinking holds that as the radix of the FFT (i.e. the size of the basic DFT computational unit) increases to 8, 16, and greater, (i.e., a division of the transform into 8, 16, or greater number of branches in the divide and conquer method), the computational "cost" in terms of nontrivial multiplications that must be performed in each high radix butterfly (the butterfly in a higher radix FFT) increases effectively canceling the gains obtained by the decreased number of twiddle factor multipliers between the butterflies. The inventor discovered that this is not necessarily true. Contrary to that conventional thinking, the present invention provides a very powerful and IC chip area efficient FFT processor using a relatively small number of low power, fixed coefficient multipliers in FFTs having radixes greater than 4. The use of only a relatively small number of fixed coefficient multipliers is achieved by taking advantage of certain twiddle factor relationships (explained in the detailed description below).
Thus, the present invention pertains to fast Fourier transform (FFT) processors of higher radixes while at the same time using only minimal integrated circuit chip area to efficiently perform fast Fourier transform operations with minimal power. Preferably, the present invention is employed in any FFT architecture having a radix greater than 4. The example embodiments use butterfly modules having sizes of 8 or 16 in the context of a real-time, pipeline FFT processor architecture. For a radix-8 implementation, the FFT processor is constructed using radix-2.sup.3 butterfly processing modules. For a radix-16 implementation, the FFT processor is constructed using radix-2.sup.4 butterfly processing modules.
In the radix-8 example embodiment implemented as a radix-2.sup.3, each butterfly module is implemented using three, 2-point butterfly units coupled together in pipeline fashion. An input data sequence is applied to an input of one of the three butterfly units and processed through the three, pipelined butterfly units to generate a Fourier transformed data sequence. Each butterfly unit includes a single delay feedback register. Of the three nontrivial, twiddle factor multiplications (each of these three nontrivial, twiddle factor multiplications is performed multiples times in the butterfly) required for a radix-8 butterfly module, the present invention implements those three multiplications using only one fixed coefficient multiplier circuit. In the radix-16 example embodiment implemented as a radix-2.sup.4, each butterfly module is implemented using four, 2-point butterfly units coupled together in pipeline fashion. Of the nine, nontrivial, twiddle factor multiplications (each of these nine, nontrivial, twiddle factor multiplications is performed multiple times in the butterfly) required for a radix-16 butterfly module, the present invention implements those nine multipliers using only two fixed coefficient multiplier circuits.