The class of Fourier transforms that refer to signals that are discrete and periodic in nature are known as Discrete Fourier Transforms (DFT). The discrete Fourier transform (DFT) plays a key role in digital signal processing in areas such as spectral analysis, frequency domain filtering and polyphase transformations.
The DFT of a sequence of length N can be decomposed into successively smaller DFTs. The manner in which this principle is implemented falls into two classes. The first class is called a “decimation in time” approach and the second is called a “decimation in frequency” method. The first derives its name from the fact that in the process of arranging the computation into smaller transformations the sequence “x(n)” (the index ‘n’ is often associated with time) is decomposed into successively smaller subsequences. In the second general class the sequence of DFT coefficients “x(k)” is decomposed into smaller subsequences (k denoting frequency). Embodiments of the present invention employ “decimation in time”.
Since the amount of storing and processing of data in numerical computation algorithms is proportional to the number of arithmetic operations, it is generally accepted that a meaningful measure of complexity, or of the time required to implement a computational algorithm, is the number of multiplications and additions required. The direct computation of the DFT requires “4N2” real multiplications and “N(4N−2)” real additions. Since the amount of computation and thus the computation time is approximately proportional to “N2” it is evident that the number of arithmetic operations required to compute the DFT by the direct method becomes very large for large values of “N”. For this reason, computational procedures that reduce the number of multiplications and additions are of considerable interest. The Fast Fourier Transform (FFT) is an efficient algorithm for computing the DFT.
The conventional method of implementing an FFT or Inverse Fast Fourier Transform (IFFT) uses a radix-2/radix-4/mixed-radix approach with either “decimation in time (DIT)” or a “decimation in frequency (DIF)” approach.
The basic computational block is called a “butterfly”—a name derived from the appearance of flow of the computations involved in it. FIG. 1 shows a typical radix-2 butterfly computation. 1.1 represents the 2 inputs (referred to as the “odd” and “even” inputs) of the butterfly and 1.2 refers to the 2 outputs. One of the inputs (in this case the odd input) is multiplied by a complex quantity called the twiddle factor (WNk). The general equations describing the relationship between inputs and outputs are as follows:X[k]=x[n]+x[n+N/2]WNk X[k+N/2]=x[n]−x[n+N/2]WNk 
An FFT butterfly calculation is implemented by a z-point data operation wherein “z” is referred to as the “radix”. An “N” point FFT employs “N/z” butterfly units per stage (block) for “logz N” stages. The result of one butterfly stage is applied as an input to one or more subsequent butterfly stages.
Computational complexity for an N-point FFT calculation using the radix-2 approach ═O(N/2*log2N) where “N” is the length of the transform. There are exactly “N/2*log2N” butterfly computations, each comprising 3 complex loads, 1 complex multiply, 2 complex adds and 2 complex stores. A full radix-4 implementation on the other hand requires several complex load/store operations. Since only 1 store operation and 1 load operation are allowed per bundle of a typical VLIW processor that is normally used for such implementations, cycles are wasted in doing only load/store operations, thus reducing ILP (Instruction Level parallelism). The conventional nested loop approach requires a high looping overhead on the processor. It also makes application of standard optimization methods difficult. Due to the nature of the data dependencies of the conventional FFT/IFFT implementations, multi cluster processor configurations do not provide much benefit in terms of computational cycles. While the complex calculations are reduced in number, the time taken on a normal processor can still be quite large. It is therefore necessary in many applications requiring high-speed or real-time response to resort to multiprocessing in order to reduce the overall computation time. For efficient operation, it is desirable to have the computation as linearly scalable as possible—in other words the computation time reducing in inverse proportion to the number of processors in the multiprocessing solution. Current multiprocessing implementations of FFT/IFFT however, do not provide such a linear scalability.
U.S. Pat. No. 6,366,936 describes a multiprocessor approach for efficient FFT. The approach defined is a pipelined process wherein each processor is dependent on the output of the preceding processor in order to perform its share of work. The increase in throughput does not scale proportionately to the number of processors employed in the operation.
U.S. Pat. No. 5,293,330 describes a pipelined processor for mixed size FFT. Here too, the approach does not provide proportional scalability in throughput, as it is pipelined.
A scheme for parallel FFT/IFFT as described in “Parallel 1-D FFT Implementation with TMS320C4x DSPs” by the semiconductor group-Texas Instruments, uses butterflies that are distributed between two processors. In this implementation, inter processor communication is required because subsequent computations on one processor depend on intermediate results from other processors. Every processor computes a butterfly operation on each of the butterfly pairs allocated to it and then sends half of its computed result to the processor that needs it for the next computation step and then waits for the information of the same length from another node to arrive before continuing computations. This interdependence of processors for a single butterfly computation does not support proportionate increase in output with increase in the number of processors.
Our co-pending application no. 1208/D/02 describes a linearly scalable FFT/IFFT system. The system incorporates a shared memory wherein each processor accesses correct data samples from the shared memory. Distribution is such that no inter-processor communication is required during the butterfly computation. However there is a requirement of inter-processor communication between stages.
Though a shared memory system is easier it is not very economical. This is because this approach requires multi port memories that are very expensive. Therefore a distributed memory system is more economical. The distributed memory architecture requires a media to communicate data among the processors. Hence it is desirable that the data communication among the processors in distributed memory architecture is minimum. Since the input data is distributed in equal size segments to each processor and each processor performs computations only on the data in its local memory, the memory requirement for individual processor reduces resulting in a lower requirement for silicon area and cost.