The present invention relates to the field of digital signal processing, and more specifically, to a system and method for performing forward and inverse discrete Fourier transforms in a microprocessor-based processing environment.
Digital signal processing (DSP) is fundamental to a variety of modern technologies. Improvements in DSP algorithms are rapidly incorporated into new products and devices, and serve as a motive force behind the modem telecommunications industry. This is especially true for improvements in transformation algorithms such as the Fast Fourier Transform.
As with other types of processing, digital signal processing may be accomplished in a batch mode or a real-time mode. Batch-mode processing involves (a) reading input data from memory, (b) performing a number DSP operations, and (c) storing the resultant output data to memory. One disadvantage of batch-mode processing is that the storage-capacity requirement for the memory may be prohibitively high. For example, if the input data represents frames of a video signal, a large amount of memory may be consumed to store a few minutes worth of video frames.
On the other hand, real-time processing involves receiving an input data stream from a data source, operating on the input data stream (i.e. applying a DSP algorithm) to generate an output data stream, and providing the output data stream to a data sink, where the data rate of the input data stream and/or the output data stream are specified. For example, the data source may be a video camera which generates video frames at a fixed rate, and the data sink may be a video monitor. Real-time processing requires that the average rate of throughput of the DSP algorithm must be greater than or equal to the specified data rate. The input and output data streams are generally buffered. The size of the input buffer and the specified data rate determine a maximum time for the DSP algorithm to process the input buffer contents.
Real-time processing generally implies less of a necessity for data storage since the output data stream is consumed (i.e. used) by the data sink shortly after it is generated by the DSP algorithm. However, the computational throughput requirement of real-time processing typically places a severe demand on the efficiency of the DSP algorithm and the speed of the processor used to implement it.
DSP theory provides a host of tools for the analysis and representation of signal data. Among these tools, the Fourier Transform is one of the most important. The Fourier Transform operates on an input function in order to generate a resultant function. FFTs operate upon a block of data extracted from this input function, either sequential blocks or overlapping blocks. There are real and complex-valued input and output versions of the FFT, in-place and not-in-place computational forms. The complex value of the resultant function at a given frequency represents the amplitude and phase of a sinusoidal component of the input function. Thus, the Fourier Transform is said to perform a conversion between the time domain and the frequency domain. The Fourier Transform just described is generally referred to as the forward Fourier transform to distinguish it from the inverse Fourier transform. The inverse Fourier Transform inverts (i.e. reverses) the forward Fourier transform. Thus, the inverse Fourier Transform operates on functions of frequency and generates corresponding functions of time.
By use of the forward and inverse Fourier Transforms, a signal may be converted between a time-domain representation and a frequency-domain representation. The forward Fourier transform is said to perform a spectral analysis of a time-domain signal because it computes the amplitude and phase of all the sinusoidal components of the time-domain signal. The inverse Fourier transform is said to perform a spectral synthesis of a time-domain signal because the formula for the inverse transform describes the time-domain signal as a linear combination (albeit infinite combination) of the spectral components.
Fourier transforms are inherently continuous mathematical transformations. Since computers are constrained to operate on discrete data, a special discretized form of the Fourier transform is used for computer implementations, i.e. the so called Discrete Fourier Transform (DFT). The Discrete Fourier Transform may be used in a wide variety of applications and allows an arbitrary input array size. However, the straightforward DFT algorithm is often prohibitively time-consuming.
In 1965, Cooley and Tukey disclosed an efficient algorithm for performing the Discrete Fourier Transform. This algorithm, known as the Fast Fourier Transform, exploits symmetries inherent in DFTs of order 2K for any integer K. The order of a DFT may be defined as the size of its input array. The structure of the FFT algorithm allows one complex multiplication factor ejw to be used for all the butterflies in a group rather than having to recalculate complex-valued multiplication factors for each butterfly.
In addition, the FFT algorithm uses either (a) a natural ordering to access input operands and a special interleaved order to write output operands, or (b) the special interleaved order to access input operands and natural ordering to write output operands. The special interleaved ordering is then used either at the input or the output but not both. For clarity of discussion, suppose the input operands are accessed in the interleaved order. Furthermore, suppose that the input array comprises eight samples of a signal X. Thus, the eight samples may be represented as X(000), X(001), X(010), X(011), X(100), X(101), X(110) and X(111) in the order they appear in the input array. The FFT algorithm accesses these input values in the order X(000), X(100), X(010), X(110), X(001), X(101), X(011) and X(111). Coley and Tukey realized that the addresses of samples in the access order and the memory order were effectively bit reversed. Another way to interpret the special access order is in terms of a reverse carry bit from an address calculation adder. Carries are propagated from left to right rather than the more typical right-to-left direction.
In many processor architectures used for Digital Signal Processing, such as the popular Harvard architecture, the throughput of a processing algorithm is determined primarily by the total number of data multiplies and additions. Data moves and branching operations (such as, e.g., those which implement a loop construct) may require significantly lower overhead than multiplies and additions in many architectures. Consequently, a reasonable estimate of the computational complexity of a processing algorithm may be obtained by counting the number of data multiplies as a function of the number of data points in the input buffer. An N-point DFT requires N2 multiplies. For large array sizes (which correspond to fine frequency resolution), the number of arithmetic operations then becomes extremely large. In contrast, the number of multiplies for an N-point FFT is Nxc2x7log2(N). Thus, the FFT exhibits a greatly reduced complexity relative to the DFT especially for larger array sizes.
The Fast Fourier Transform algorithm, because of its greatly reduced computational complexity, allowed more sophisticated signal processing algorithms to be implemented in real-time applications than are possible with the DFT. For example, FFTs are one key component of interactive speech-recognition algorithms, so faster FFTs enable better and more complete analysis of speech.
Traditional x86 processors are not well adapted for the types of calculations used in signal processing. Thus, signal processing software applications on traditional x86 processors have lagged behind what was realizable on other processor architectures. There are been various attempts to improve the signal processing performance of x86-based systems. For example, microcontrollers optimized for signal processing computations (DSPs) have been provided on plug-in cards or the motherboard. These microcontrollers operated essentially as coprocessors enabling the system to perform signal processing functions.
Advanced Micro Devices, Inc. (hereinafter referred to as AMD) has proposed and implemented a set of SIMD (single-instruction multiple-data) instructions on an x86 processor known as the AMD-K6(copyright)-2 and subsequent processors. The SIMD instructions are collectively referred to as 3DNow!(trademark). The AMD-K6(copyright)-2 is highly optimized to execute the 3DNow!(trademark) instructions with minimum latency. The 3DNow!(trademark) instructions invoke parallel floating-point operations. Software applications written for execution on the AMD-K6(copyright)-2 may use the 3DNow!(trademark) instructions to accomplish signal processing functions and traditional x86 instructions to accomplish other desired functions. Thus, the AMD-K6(copyright)-2 provides general purpose and signal processing capacity on a single chip, and therefore, at a cost which is significantly less than x86-based systems with DSP coprocessors.
Furthermore, since x86 processors tend to be on a very steep revision curve, the total DSP processing power available through the 3DNow!(trademark) instructions should continue to increase very rapidly with attendant revolutions in the types of applications that are possible on personal computers. Several semiconductor vendors have licensed and intend to implement 3DNow!(trademark) as an industry standard. Thus, it is especially desirable to achieve implementations of the FFT algorithm which optimize the use of the 3DNow!(trademark) instruction set on x86 processors.
A number of fast algorithms for performing the DFT are described by Alan V. Oppenheim et al. in xe2x80x9cDiscrete-Time Signal Processingxe2x80x9d [published by Prentice Hall, ISBN 0-13-216292-X, pages 587-609] which is hereby incorporated by reference. FIG. 1 is a signal flow graph which corresponds to a prior art FFT algorithm (see Oppenheim, xe2x80x9cDiscrete-Time Signal Processingxe2x80x9d, page 587). The flow of signal information is from left to right in the diagram. FIG. 1 illustrates an eight point inverse FFT. The complex input values are denoted x(0) through x(7). The complex output values are denoted y(0) through y(7), and represent the inverse FFT values in bit-reversed order. The flow graph includes three passes labeled with index m running from one to three. In general, the number of points N to be transformed and the number of passes v are related by the equation v=log2(N). The coefficients WNxe2x88x92k are defined by the relation             W      N              -        k              =          exp      ⁢              (                              2            ⁢                          xe2x80x83                        ⁢            π            ⁢                          xe2x80x83                        ⁢                          k              ·              j                                N                )              ,
where j is the complex square-root of xe2x88x921. The coefficients WNxe2x88x92k are often referred to as xe2x80x9ctwiddle factorsxe2x80x9d. In Cartesian form, the twiddle factor expression takes the form       W    N          -      k        =            cos      ⁡              (                              2            ⁢                          xe2x80x83                        ⁢            π            ⁢                          xe2x80x83                        ⁢            k                    N                )              +          j      ·                        sin          ⁡                      (                                          2                ⁢                                  xe2x80x83                                ⁢                π                ⁢                                  xe2x80x83                                ⁢                k                            N                        )                          .            
It is noted that the computations in the flow graph of FIG. 1 naturally occur in symmetric pairs called butterflies. FIG. 2 illustrates a butterfly from the generic pass m. The butterfly receives two complex-valued inputs urm and usm from the previous pass (mxe2x88x921), and generates two complex-valued outputs urm+1 and usm+1 which are supplied to the next pass (m+1). The indices r and s define the vertical position of operands in the flow graph, e.g. the value urm occurs on the rth horizontal line (at the input to the mth pass). The structure of the butterfly coincides with the following equations defining urm+1 and usm+1 in terms of urm and usm:
urm+1=urm+WNxe2x88x92kxc2x7usm,
usm+1=urmxe2x88x92WNxe2x88x92kxc2x7usm.
Each pass of the flow graph of FIG. 1 is organized into one or more groups of similar butterflies. For example, the first pass (m=1) comprises a collection of (N/2) successive butterflies which use the same coefficient WN0, where N is the number of points to be transformed. The second pass (m=2) comprises two distinct groups of butterflies, where each group includes (N/4) successive butterflies involving the same coefficient WNxe2x88x92k: the first group uses coefficient WN0, and the second group uses coefficient WNxe2x88x922. In the generic pass m, there are 2mxe2x88x921 groups of butterflies. Each of these groups has 2vxe2x88x92m butterflies which use the same coefficient WNxe2x88x92k.
Furthermore, the distance between butterfly operands given by difference (sxe2x88x92r) depends on the pass m. In the first pass (m=1), the butterfly operands are separated by a distance of (N/2) points. In the second pass (m=2), the butterfly operands are separated by a distance of (N/4) points. In general, the butterfly operands are separated by distance sxe2x88x92r=2vxe2x88x92m.
The problems outlined above of implementing a rapid (forward or inverse) Fast Fourier Transform on an x86-based microprocessor are in large part solved by the software-based method of present invention. Processors such as the AMD-K6(copyright)-2 which support 3DNow!(trademark) are configured to execute instructions (or parts of instructions) in parallel. According to the present invention, the execution efficiency of an FFT algorithn may be maximized by carefully composing code sequences of the FFT algorithm so that resource conflicts and the attendant processor stalls are avoided.
The present invention contemplates a system and method for performing a Fast (forward or inverse) Fourier transform in a computer system comprising an x86 processor which supports 3DNow!(trademark) instructions, and a memory. The 3DNow!(trademark) instructions are explicitly parallel and operate on two sets of 32-bit floating-point data simultaneously. A special set of internal registers may serve as the source and/or destination registers for the 3DNow!(trademark) instructions. Each of the internal registers includes a lower 32-bit component and an upper 32-bit component. In many implementations, the existing 64-bit floating-point/MMX(copyright) registers may also be utilized as the 3DNow!(trademark) source and/or destination registers with attendant additional logic to implement the 3DNow!(trademark) SIMD usage of these registers.
The method of the present invention preferably comprises: (a) executing an initial processing loop which accesses an input array contained within the memory (or cache) and generates a second-pass output array, wherein the input array stores N complex input values; (b) executing log2(N)xe2x88x923 intermediate-pass processing iterations, where a first one of the intermediate-pass processing iterations operates on the second-pass output array, and a last one of the intermediate-pass processing iterations generates a penultimate-pass output array; and (c) executing a final-pass processing loop which operates on the penultimate-pass output array and generates a final output array, wherein the final output array represents the Discrete Fourier Transform (DFT) or Inverse Discrete Fourier Transform (DFT) of the input array values. (In an in-place computation, the input array, the second-pass output array, the penultimate-pass output array, and the final output array are separate names for a common block of memory which is overwritten repeatedly).
In the preferred embodiment, each of the intermediate-pass processing iterations comprises a series of group iterations, and furthermore, each group iteration includes a series of butterfly-pair iterations. The generic butterfly-pair iteration comprises executing a code sequence which includes a plurality of the parallel floating-point instructions. In addition to the plurality of parallel floating-point instructions, other instructions are necessary in the code sequence. These xe2x80x9cother instructionsxe2x80x9d may be inserted between two of the parallel floating-point instructions which use a common execution resource in the x86 processor such as the 3DNow!(trademark) adder or multiplier. Thus, resource conflicts are advantageously avoided at execution time.
The non-SIMD instructions of the code sequence also include a plurality of load instructions. The load instructions are also separated from each other in the code sequence by non-load instructions. This separation allows each load instruction sufficient time to complete before the next load instruction is issued, and thereby avoids stalls in the load pipeline. The non-load instructions may include some of the plurality of parallel floating-point instructions.
In the evolution of modem x86-architecture microprocessors clock-cycle times have continuously decreased. Accordingly, the time required to access system memory has become an increasingly significant constraint to the processing capacity of x86 microprocessors. Thus, tremendous effort has been targeted toward the development of technologies for decreasing memory access time and increasing memory access band-width. As a result of these technologies, in the AMD-K6(copyright), AMD-K6(copyright)-2, AMD-K6(copyright)-III, and subsequent members of the family, accessing a full width operand from memory is generally as fast as accessing a narrower operand. In addition, the time required to access a 64-bit operand with a single SIMD instruction such as movq or punpckldq is much smaller than the time required to access the same data in two 32-bit slices with two non-SIMD instructions. Thus, by accessing 64-bit operands with SIMD instructions, the FFT algorithm of the present invention may advantageously access pairs of 32-bit data more efficiently than prior art algorithms.
The initial-pass processing loop performs butterfly computations associated with the first two passes of the FFT in an integrated (i.e. unified) fashion. Since the complex coefficients W(xe2x88x92t,N) for the first two passes reduce to either 1 or j (i.e. square-root of xe2x88x921), the equations for the first two passes easily simplify and admit combination. The initial-pass processing loop executes N/2 butterfly-pair iterations. Each of the butterfly-pair iterations generates four complex output values which are written to the second-pass output array. The real and imaginary parts of each of these complex output values are computed by executing a set of parallel operations on the data pairs stored in the internal registers.