This invention relates generally to data processing and more particularly to digital signal processing applications employing fast Fourier transforms (FFTs).
In general, digital signal processing (DSP) tasks are performed either in the time domain or in the frequency domain. Time domain refers to the situation where data are continuously sampled and, for subsequent digital processing, quantized. Frequency domain refers to the situation where the original time-ordered sequence of data is transformed into a set of spectra, with specific frequencies and amplitudes. These data are also usually quantized for further processing. For many DSP tasks, dealing with spectra in the frequency domain yields more insight and provides a more tractable way of solving problems than dealing with sequentially ordered data points in the time domain. The issue then becomes how to most efficiently transform time domain data into the frequency domain.
The most common algorithm used to do this transformation is called the Fast Fourier Transform, or FFT.
The FFT is obtained by starting with a more general form called the Discrete Fourier Transform, and then subdividing the calculations until one is left with an irreducible kernel operation, most commonly operating on two input data points only. This operation is called a butterfly operation because of the shape of the signal flow graph used to describe it. In the simplest butterfly operation ("radix-2"), the two input data points have their sum and difference multiplied by a coefficient, or "twiddle" factor. The sequence in the order of data for the two data points is then interchanged at the output. This interchange, or crossover gives rise to the term "butterfly" when the calculation is drawn as a signal flow graph.
Time domain data is transformed into the frequency domain by successively applying the butterfly operation to the entire time domain data set two data points at a time, and repeating this process multiple times, or "passes" using the partially transformed results of the prior pass of calculation. For Radix-2 calculation, the number of required passes of pairwise calculation (butterfly operations) through the data set is equal to the log base radix of the number of data points in the set. For Radix-2 butterfly calculation, the factor log base radix would be the log base 2. For a data set of 1024 points, the log base 2 of 1024 is 10(2.sup.10 =1024), so 10 passes through the data and its intermediately calculated results are required in order to calculate a 1024 point ("1K") FFT. Transforming 1024 time domain data points is then seen to require 10.times.512=5120 Radix-2 butterfly operations.
Other radices are possible, besides the most basic one of radix two. The physical interpretation is as follows. Radix-2 operates on two numbers at a time, Radix-4 is a higher order butterfly calculation that operates on four numbers at a time, Radix-16 is a still higher order butterfly calculation that operates on 16 numbers at a time, and so forth. The advantage of higher radix computation is that fewer passes through the data are required, because the pass-defining term (log base radix), is a smaller number. For example, a Radix-2 FFT of 256 points requires 8 passes through the data (2.sup.8 =256), while a Radix-16 evaluation of the FFT requires only two passes through the data (16.sup.2 =256). Therefore, if sufficient computational resource is present to permit evaluation of a Radix-16 butterfly in the equivalent time per clock per data point of a Radix-2 butterfly, the FFT calculation using the Radix-16 butterfly will be faster.
A general style of FFT calculation is called by-pass form. In this form of calculation, there is a fixed datapath that performs the butterfly calculation, but the data are supplied through external memory. The datapath performs a fixed set of calculations (the butterfly operation) on data that is presented to it. It is therefore necessary to properly sequence the data in these external memories for appropriate pass by pass FFT calculation by the datapath.
The Sharp.RTM. LH9124 FFT processor chip is an example of a bypass form processor that does FFTs. It has four data ports: Q, A, B, and C. Data are input into the Q or acquisition port. The results of the initial pass of butterfly calculation through the input data are output to port B. On the next pass of butterfly calculation, partially transformed data are read from the memory connected to Port B, processed, and output to the memory connected to Port A. Subsequent passes "ping-pong" the data between ports A and B. Port C is used to input coefficient data.
The time taken for any calculation may be described either as latency or as throughput. Latency is the time it takes from an initial condition until valid output data are obtained. Throughput is the time between subsequent data outputs, once a system is up and running.
There are three main ways of improving the FFT speed of bypass based systems.
1. Increase the order of the radix, thus decreasing the number of required passes through the data. This approach improves both latency and throughput, but is costly in terms of computational resource required for each processing element.
2. Cascade (pipeline) the datapath processors, such that each processor is responsible for the calculation of one pass of the FFT only, and calculates that pass repetitively on different blocks of data. Cascading improves throughput, but not latency. The very first FFT will still have to be processed by the appropriate number of passes, or cascaded stages. Every FFT after the first one will be output in 1N the time of the first, where N is the number of stages.
3. Parallel the datapath processors, such that a single large FFT is divided into N smaller FFTs. Each datapath processor is then dedicated to the calculation of an FFT that is 1/N the size of the original, although the number of passes appropriate for the original, larger data set is still required. Both latency and throughput are improved with this arrangement. Latency is improved because there are 1/N the number of original points on which any individual datapath processor has to operate. Throughput is improved for the same reason--there are fewer points for any individual datapath processor.
Note that cascading and parallel operation can be combined.
It would be advantageous if, in carrying out FFT computations, lower latency and improved throughput could be realized using parallel datapaths, in which all the addresses used in like data ports of each parallel datapath are supplied from one address sequencer, thus simplifying the system connections.
It would also be advantageous if such advantages could be obtained in a system which permitted use of ordinary single-port SRAM's in a parallel datapath connection.
Accordingly, the present invention is a system for use in digital signal processing applications for processing N data points through Y processing stages using Z execution units, where Z is a power of 2, each execution unit having a plurality of I/O ports including A and B ports. The system for processing the data further includes an addressable memory unit for each A and B port on each execution unit, including memory units A(1) through A(Z) operatively connected to the A ports of execution units P(1) through P(Z), respectively, and including memory units B(1) through B(Z) operatively connected to the B ports of execution units P(1) through P(Z), respectively. The data is processed through the Y processing stages by moving data in selected sequences through each execution unit P(q) between memory units A(q) and B(q), and in which at the start of the processing of the data the N data points are distributed between memory units connected to one of the ports of the execution units, either A(1) through A(Z) if distributed to the memory units connected to the A ports of the execution units or B(1) through B(Z) if distributed to the memory units connected to the B ports of the execution units, each memory unit receiving N/Z data points, and each execution unit P(1) through P(Z) processing a block of N/Z data points through the Y processing stages, the processing occurring in parallel.
The system for processing data comprises a first address generator AG-A operatively connected to memory units A(1) through A(Z) for supplying the address sequences used in the Y processing stages to memory units A(1) through A(Z), and a second address generator AG-B operatively connected to memory units B(1) through B(Z) for supplying the address sequences used in the Y processing stages to memory units B(1) through B(Z). Address generator AG-A supplies the same address sequence to all A(1) through A(Z) memory units, and address generator AG-B supplies the same address sequence to all B(1) through B(Z) memory units, whereby the N data points are processed in Z parallel streams through the Z execution units.
A more specific embodiment of the invention which carries out digital signal processing applications for processing 1024 data points through Y processing stages using 4 execution units P(1) through P(4) is also disclosed. Each of the execution units has a plurality of I/O ports including A and B ports. The system for processing the data further includes an addressable memory unit for each A and B port on each execution unit, including memory units A(1) through A(4) operatively connected to the A ports of execution units P(1) through P(4), respectively, and including memory units B(1) through B(4) operatively connected to the B ports of execution units P(1) through P(4), respectively. The data is processed through the Y processing stages by moving data in selected sequences through each execution unit P(q) between memory units A(q) and B(q), and in which at the start of the processing of the data the 1024 data points are distributed between memory units connected to one of the ports of the execution units, either A(1) through A(4) if distributed to the memory units connected to the A ports of the execution units or B(1) through B(4) if distributed to the memory units connected to the B ports of the execution units, each memory unit receiving 256 data points, and each execution unit P(1) through P(4) processing a block of 256 data points through the Y processing stages, the processing occurring in parallel.
The system further comprises a first address generator AG-A is operatively connected to memory units A(1) through A(4) for supplying the address sequences used in the Y processing stages to memory units A(1) through A(4), and a second address generator AG-B is operatively connected to memory units B(1) through B(4) for supplying the address sequences used in the Y processing stages to memory units B(1) through B(4). Address generator AG-A supplies the same address sequence to all A(1) through A(4) memory units, and address generator AG-B supplies the same address sequence to all B(1) through B(4) memory units, whereby the 1024 data points are processed in 4 parallel streams of 256 data points each through the 4 execution units.
The detailed description which follows also describes a system suitable for processing up to N data points through Y processing stages using 4 execution units.