The present invention relates broadly to discrete Fourier transform (DFT) processors and more particularly, to a multipoint pipeline processor of radix 2 for computing a discrete Fourier transform based on a combination of techniques derived from fast Fourier transform (FFT) and Winograd discrete Fourier transform (WDFT) type algorithms.
The calculation of the discrete Fourier transform (DFT), as denoted generally by the following equation ##EQU1## is one of the central operations in digital signal processing. In equation (1) above, the term x(n) denotes an input sequence of points sampled from a waveform over a time interval comprising N samplings, wherein the index n is often termed the input index; similarly, the term X(k) denotes an output sequence of frequency harmonics corresponding to the discrete Fourier transformation of the N sampled points of the waveform, wherein the index k is often termed the output index. Since the DFT is considered such a powerful mathematical tool, it is often applied to systems which deal with signals that are discrete in time, like Doppler spectrum analysis (Fourier analysis) digital filtering (convolution) and chirp filtering (correlation), for example.
The DFT equation (1) above may be more simply expressed in the form of equation (2) below ##EQU2## where EQU W=exp[-j(2.pi./N)]. (3)
Equation (2) is commonly expressed in the matrix representation ##EQU3## which is a linear transformation of the N-dimensional data vector x(n) into the vector X(k) of frequency samples. Assuming that the term x(n) is complex, the linear transformation of equation (4) requires N.sup.2 complex multiplications and N(N-1) complex additions. Consequently, the DFT becomes impractical, in a sense, as the number of points, N, increases in length because of the large number of complex operations required.
In 1965, a paper entitled "An Algorithm for the Machine Calculation of Complex Fourier Series" by Cooley and Tukey published in Math. Comput., vol. 19, pp. 297-301, (April, 1965), had a major impact on signal processing by stimulating the development and widespread use of what is commonly termed the fast Fourier transform (FFT). The FFT algorithms use the relationship EQU W.sup.nk =W.sup.nk mod (N) (5)
where [nk mod (N)] is the remainder of the division of nk by N. For example, if N=4=2.sup.2, then when n=2 and k=3, W.sup.6 =W.sup.2. For convenience, to illustrate the FFT algorithm the number of sample points, N, is choosen as a power of 2, say for example, N=4. The first step in the development of the FFT algorithm for the present example is to rewrite equation (4) as ##EQU4## where each element in the W square matrix is replaced with its mod (N) equivalent. Thereafter, the second step involves a matrix factorization, peculiar to the theory of the FFT algorithm which may include at least one row interchange and a permutation of at least one of the output and/or input indexes of the column vectors X(k) and x(n), respectively. In the present example, the resulting matrix equation after the second step may take the form of: ##EQU5##
A signal flow graph is generally used to illustrate the operations performed in the matrix equation resulting from the matrix factorization process of the FFT algorithm. For equation (7) above, the signal flow graph may appear as that shown in FIG. 1. Note that, in the signal flow graph of FIG. 1, there are two computational arrays x.sub.1 (n) and x.sub.2 (n) resulting from the matrix computations of matrix A (equation 7) with the data vector x(n) and matrix B (equation 7) with the computational array x.sub.1 (n), respectively. In general, there will be .delta. computational arrays where N=2.sup..delta.. Further scrutinization of the signal flow graph of FIG. 1, yield the fact that only four complex multiplications and only eight complex additions are required to compute each array x.sub.1 (n) and x.sub.2 (n). Consequently, only eight complex multiplications are required for the entire transformation reducing the required number of complex multiplications for N=4 from 16 for the DFT to 8 for the FFT.
In general, for N=2.sup..delta., the number of complex multiplications for the FFT algorithm is in the order of Nlog.sub.2 (N) as compared with N.sup.2 for the DFT. It is readily apparent that as N increases in number, the FFT algorithm provides even greater savings in complex multiplications than that shown for N=4. For example, if N=2.sup.10, then for the DFT, the number of complex multiplications would be 2.sup.20 ; however for the FFT, the number would be on the order of 10.2.sup.10. This is a savings of complex multiplications on the order of 100/1 which makes the FFT algorithm convincingly superior to the DFT as a computational tool. For a more comprehensive study of the fast Fourier transform (FFT) for background purposes reference is made herein to the text "The Fast Fourier Transform" by E. Oran Brigham, published in 1974 by Prentice-Hall, Inc. In addition, for a generalized summary of the state of the art improvements to the FFT algorithm, reference is also made to a paper entitled "A Prime Factor FFT Algorithm Using High Speed Convolution" written by Dean P. Kolba and Thomas W. Parks for the IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-25, No. 4, August, 1977.
One known form of implementation of the FFT algorithm is the architecture of a pipeline of computational elements (CE's). The primary advantage of the pipeline FFT architecture is the parallel processing of the computational elements which substantially improve computational speed with a hardware structure involving virtually no overhead for control functions. A typical example of a basic computational element (commonly termed butterfly) of the radix-2, decimination in time version of the FFT algorithm is shown in FIG. 2. The butterfly or CE generally consists of one complex multiplication and two complex additions. These CE's are usually cascadedly coupled in the pipeline architecture to form the computational complexity of the FFT.
Generally, the number of computations required for a complete transform for an N-point, radix-2 FFT is (N/2) log.sub.2 N, since the FFT algorithm consists of log.sub.2 N computational stages and each stage requires N/2 butterflies. A glance at FIG. 1 will verify these numbers for the case N=4. There are log.sub.2 4=2 computational stages and 4/2 butterflies for each stage making a total of 4/2 log.sub.2 4 or 4 butterflies in all. However, realizing that W.sup.2 =-W.sup.0, and W.sup.3 =-W.sup.1, a pipeline structure of just two cascadedly coupled CE's may be arranged to perform the computations of the signal flow graph of FIG. 1. An exemplary schematic architecture of such a pipeline architecture is shown in FIG. 3 wherein CE.sub.1 and CE.sub.2 are in a circuit form similar to that shown in FIG. 2 and wherein the blocks denoted by R and D are merely information holding and delay registers, respectively, enabled for updating data by a conventional clock (not shown). The holding and delay registers permit the parallel processing of the CE's to occur. A set of commutator switches SW1 and SW2 appropriately align the data presented to CE.sub.2 in a timely fashion.
A typical operation of the simplified pipeline embodiment of FIG. 3 may be described in connection with the signal flow graph of FIG. 1. To start with, the input data array points x(0) and x(2) may be presented at input points A and B, respectively. At the next clock pulse i, the data is captured by registers R1 and R2, respectively. During the interval between clock pulses i and i+1, CE.sub.1, computes x.sub.1 (0) and x.sub.1 (2) and presents this data to registers D1 and D2, respectively. Concurrently, data points x(1) and x(3) are presented to registers R1 and R2. At the clock pulse i+1, all the registers are updated. Between clock pulse i+1 and i+2, CE.sub.1 computes x.sub.1 (1) and x.sub.1 (3) in accordance with the signal flow graph of FIG. 1 and the previously calculated x.sub.1 (0) is present at A'. Note that D1 is a one pulse delay register and D2 is a two pulse delay register. At the next clock pulse i+2, all the registers are again updated and switches SW1 and SW2 are controlled to position 1. Between clock pulses i+2 and i+3, CE computes x.sub.1 '(Q) and x.sub.1 '(2) from the second set of input data x'(0) and x'(2) presented thereto. In addition, present at A' and C' are the computed points x.sub.1 (1) and x.sub.1 (0), respectively, which are presented to CE.sub.2 through switches SW1 and SW2 (position 1) wherein x.sub.2 (0) and x.sub.2 (1) are computed in accordance with the signal flow graph in FIG. 1. At the next clock pulse i+3, all the registers are again updated and switches SW1 and SW2 are controlled to position 2. Thereafter, x.sub.1 (3) and x.sub.1 (2) are present at B' and D', respectively. Between clock pulses i+3 and i+4, CE.sub.1 computes x.sub.1 '(1) and x.sub.1 '(3) from the second set of input data points x'(1) and x'(3) presented thereto, and in a parallel processing manner, CE.sub.2 computes x.sub.2 (2) and x.sub.2 (3) from x.sub.1 (3) and x.sub.1 (2) which are presented thereto from B' and D' through switches SW1 and SW2 (position 2). Note that x.sub.2 (0) and x.sub.2 (1) are present at the output C and D respectively after clock pulse i+3 and similarly, x.sub.2 (2) and x.sub.2 (3) are present at the outputs C and D, respectively after clock pulse i+4. The radix -2 pipeline processor FIG. 3 continues in a similar manner to process serially input pairs of data in a parallel processing fashion and serially output the computed results in predetermined data pairs in accordance with the computational array pattern of the signal flow graph of FIG. 1.
It is understood that the example described above in connection with FIGS. 1, 2 and 3 was used merely to provide a simple understanding of the FFT algorithm and pipeline processor architecture in connection therewith. For a more detailed explanation of FFT pipeline processors, reference is made herein to the text entitled "Application of Digital Signal Processing" edited by Alan V. Oppenhein published by Prentice-Hall, Inc. (1978), primarily Chapter 5 pp. 265-279 which was authored by J. H. McClellan and R. J. Purdy both of MIT and the text entitled "Theory and Application of Digital Signal Processing" by Rabiner and Gold published by Prentice-Hall, Inc. (1975) primarily Chapter 10.
As was stated above, the FFT pipeline processor consists of log.sub.2 N stages for an N-point, radix-2 FFT and in general, each stage comprises a complex multiplication which in a hardware mechanization may involve two multipliers to better facilitate the subsequent complex additions of the butterfly computational element with respect to the parallel pipeline processing. It becomes readily evident that for a 32-point FFT pipeline processor, for example, as many as 10 hardware multipliers may be required, for a 64-point, as many as 12, and for a 128 point, as many as 14. It is well known to those skilled in the pertinent art that hardware multipliers, especially those of the digital variety of say 12 to 16 bits accuracy, involve many interconnected medium-scale-integrated (MSI) circuits which are very costly and take up excessive printed circuit (PC) board areas in the fabrication thereof and in addition, the hardware multipliers consume precious computational time in the operation thereof.
One of the first to successfully reduce the number of multiplication computations of the FFT by making use of the group theoretic properties of the W matrix was S. Winograd. In his concise paper entitled "On Computing the Discrete Fourier Transform", Proc. Nat. Acad. Sec. U.S.A., vol. 73, no. 4, pp. 1005-1006, April 1976, Winograd combines the conversion of a DFT to convolution for prime and prime power lengths with new convolution algorithms, which were being developed by Argarwal and Cooley at that time, for deriving short transforms. He proposed that long transforms be computed by nesting these short, high speed transforms and compared the number of operations required with that of the conventional FFT. For a comprehensive summary of the work by Winograd and others involving the conversion of a DFT to circular convolution and convolution with minimum number of multipliers, reference is again made to the paper to Kolba, et al. cited hereinabove.
In general, Winograd suggests that by combining the cyclical properties of the W matrix of the FFT with some theoretic properties of integers that the number of multipliers required to perform a DFT may be reduced by 5-10/1 over existing FFT algorithms. However, the Winograd discrete Fourier transform (WDFT) does not control the number of adds for all known results. Nonetheless, this does not produce disadvantageous results, because the number of adds of the WDFT stay within about 10% of the number required for an FFT. Accordingly, the overall number of computations of the FFT is reduced using the WDFT (see Kolba, et al. referenced above for more specific details).
One primary drawback of the WDFT algorithm is that it is not directly reduceable to practice in a pipeline processor because the structure of the WDFT is far less regular than the FFT and thus inefficient to implement in terms of a hardware pipeline machine as described supra. In addition, the input and output sequences to the computational stages are at times permitted in an unusual manner. The only implemented versions of the WDFT algorithm that are known to exist are general purpose digital computer programs where the structure of its nested signal flow graphs and permutations of its input/output sequences are handled in the programs. The paper, entitled "An Introduction to Programming the Winograd Fourier Transform Algorithm (WFTA)", by H. F. Silverman, published in the IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-25, pp. 152-165, April 1977, provides a more detailed description of examples of programming the Winograd transform and is referenced herein for that purpose. Another paper, entitled "Fixed Point Error Analysis of the Winograd Fourier Transform Algorithm" by Robert W. Patterson submitted as a Masters Thesis to the Massachusetts Institute of Technology (MIT) in September 1977, makes light of the irregularities in the signal flow graph and the unusual permutations of the input/output sequence of the Winograd transform, particularly that of the 5 point DFT shown on page 86 therein. The Patterson thesis is additionally made reference to herein for a more detailed discussion of the WFTA. A most recently issued U.S. Pat. No. 4,156,920 issued May 29, 1979 to S. Winograd may provide additional background material.
In summary, the main drawbacks of the WDFT that have apparently precluded its implementation in hardware pipeline mechanization are: (1) it is less efficient to mechanize because of less regular structure, and (2) it requires complicated memory bookeeping in hardware because of the unusual input and output sequences to various computational stages of the short transforms nested therein. It is understood that as systems are required to perform increasingly complex functions more quickly, there is a need for a DFT, like that taught by Winograd, which is apparently noticeably faster than the FFT. For large transforms this step may make the complex computations feasible and for smaller transforms, this step may allow faster computation or at least computation with less implementable hardware.
Presented herebelow is a multipoint radix-2, pipeline processor which is believed to solve the aforementioned problems of the WDFT by providing an inherently faster more cost effective way of implementing the DFT of a relatively large number of data points as compared with a state of the art comparable multipoint FFT pipeline processor. The preferred embodiment described herebelow departs from the FFT and WDFT by combining the reduction in multiplicative computations of the WDFT with the structural regularity of the FFT to provide an improved multipoint, radix-2, pipeline processor.