1. Field of the Invention
The present invention relates to performing Fast Fourier Transforms (FFTs) using a processor, and more specifically, to a method and an apparatus for performing FFTs using a vector processor with routable operands and independently selectable operations.
2. Description of the Related Art
The pursuit of higher performance has long been a defining feature of the computer and microprocessor industries. In many applications such as computer-aided design and graphics, higher performance is always needed to quickly translate users' commands into actions, thereby enhancing their productivity. Currently, the IBM PC computer architecture, based on Intel Corporation's X-86 family of processors, is an industry-standard architecture for personal computers. Because the IBM PC architecture is an industry standard, the architecture has attracted a broad array of software vendors who develop IBM PC compatible software. Furthermore, competition within the industry standard architecture has resulted in dramatic price performance improvements, thereby leading to a more rapid acceptance of computing technology by end users. Thus, the standardized nature of the IBM PC architecture has catapulted IBM PC compatible machines to a dominant market position.
The standardized nature of the IBM PC architecture is also a double-edged sword, for if the computer is not PC compatible, the sales potential for the computer becomes severely diminished. The reason for the limitation is that much of the existing software that runs on the PCs make explicit assumptions about the nature of the hardware. If the hardware provided by the computer manufacturer does not conform to those standards, these software programs will not be usable. Thus, PC system designers are constrained to evolutionary rather than revolutionary advances in the PC architecture in order to remain compatible with earlier IBM PC computers. However, it is desirable to take advantage of the semiconductor industry's ability to integrate large numbers of transistors per chip to satisfy the pent-up demand for more computing power in communication, multimedia and other consumer products.
The need for higher performance processors is evident in a number of applications such as communication, multimedia, image processing, voice recognition and scientific/engineering analysis which need to convert time domain data into frequency domain data via a mathematical link called a Fourier transform. Historically, time domain analysis is popular because people are analytically comfortable with analyzing events as a function of time, but the senses are more directed to the frequency domain. For instance, when listening to music or speech, humans do not hear individual pressure variations of the sound as they occur so quickly in time. Instead, what is heard is the changing pitch or frequency. Similarly, human eyes do not see individual oscillations of electromagnetic fields or light. Rather, colors are seen. In fact, humans do not directly perceive any fluctuations or oscillations which change faster than approximately 20 times per second. Any faster changes manifest themselves in terms of the frequency or the rate of change, rather than the change itself. Thus, the concept of frequency is as important and fundamental as the concept of time. Furthermore, in many applications, transform analysis is popular because it is often easier to formulate problems in the frequency domain rather than the time domain in designing systems. The central ideal of transform theory is that some information about the system, such as the time or spatial domain description can be transformed into an equivalent description that simplifies design or analysis.
As many natural or man-made waveforms are periodic and can be expressed as a sum of sine waves, discrete data points can be taken and translated into the frequency domain using a Discrete Fourier Transform (DFT) rather than computing the continuous spectra of the signal. In general, the types of Fourier transform applications include: number based, pattern based, and convolution based. Examples of number based applications include spectrum analysis which is used in instrumentation, audio-video processing, velocity estimation and radar signal processing. With respect to pattern based applications, many problems involve the recognition and detection of signals with a specific frequency content, such as a spectral pattern in a speech pattern. In the pattern based application, conversion to frequency domain is often a small step in the overall task and it is important that the conversion process be fast to allow for sufficient time to perform other computationally intensive pattern matching techniques. Finally, in convolution based applications, the Fourier transform is used as a simple mathematical tool to perform general filtering.
The Fourier Transform of an analog signal a(t), expressed as: ##EQU1## determines the frequency content of the signal a(t). In other words, for every frequency, the Fourier transform A(.omega.) determines the contribution of a sinusoid of that frequency in the composition of the signal a(t). For computations on a digital computer, the signal a(t)is sampled at discrete-time instants. If the input signal is digitized, a sequence of numbers a(n) is available instead of the continuous time signal a(t). Then the Fourier transform takes the form ##EQU2##
The resulting transform A(e.sup.j.omega.) is a periodic function of .omega., and only needs to be computed for only one period. The actual computation of the Fourier transform of a stream of data presents difficulties because A(e.sup.j.omega.) is a continuous function in .omega.. Since the transform must be computed at discrete points, the properties of the Fourier transform led to the definition of the Discrete Fourier Transform (DFT), given by ##EQU3##
Where a(n) consists of N points .alpha.(0), .alpha.(1), . . . , .alpha.(N-1), the frequency-domain representation is given by the set of N points A(k), k=0, 1, . . . ,N-1. The previous equation becomes ##EQU4## where W.sub.N.sup.nk =e.sup.-j2.pi.nk/N. The factor W.sub.N is sometimes referred to as the twiddle factor.
The amount of computation involving evaluating the convolution integral becomes particularly large when its impulse response H(t) has a long time duration. Thus, DFTs are computationally expensive: for every frequency point, N-1 complex summations and N complex multiplications need to be performed. With N frequency points, and counting two real sums for every complex summation and four real multiplications and two real sums for every complex multiplication, the complexity of a N-point DFT is 4N.sup.2 -2N real summations and 4N.sup.2 real multiplications. Thus, for each 1,024 point DFT, 4,194,304 real multiplications are required. Typical applications requires a number of these 1,024 point DFTs to be performed per second in real time. Hence, the applications of DFTs had been limited until the advent of the Fast Fourier transforms (FFTs).
Many variations exist in the formulation of the FFT process. Among the basic approaches where N=2.sup.r and r is an integer, one approach--decimation in time--is based upon separating a(n) into two sequences of length N/2 comprised of the even and odd-indexed samples, respectively, i.e., ##EQU5##
Each of these summations is recognized as being simply an N/2-point DFT of the respective sequence because EQU W.sup.2 =e.sup.-2.sbsp.j.sup.(2.pi./N) =e.sup.-i.sbsp.j.sup.2.pi./(N/2)
Hence, if the DFT A.sub.e (k) is generated for the even-indexed sequence a(0), a(2), . . . , a(N-2) and the DFT A.sub.0 (k) for the odd-indexed sequence a(1), a(3), . . . , a(N-1), the overall DFT is arrived at by combining the sequences as EQU A(k)=A.sub.e (k)+W.sup.k A.sub.0 (k)
As discussed earlier, the complex coefficients W.sup.k are known as twiddle factors. The N/2-point DFT's A.sub.e (k) and A.sub.0 (k) are periodic in k with period N/2, and thus their values for k.gtoreq.N/2 need not be recomputed, given those for 0.ltoreq.k&lt;N/2. This process is then applied again and again until only a 2-point DFT's remains to be computed. That is, each N/2-point DFT is computed by combining two N/4-point DFT's, each of which is computed by combining two N/8-point DFT's, and continuing on thereafter, for r stages since N=2'. The initial 2-point DFT's require coefficients of only .+-.1.
The FFT routine therefore reduces the complexity from a N.sup.2 order of complex multiplications and additions in the case of a DFT to a log.sub.2 N order complexity, each of which requires up to N complex multiplications by twiddle factors and N complex additions. An important aspect of the FFT algorithm is that it can be computed in place in memory. That is, if the input array a(n) is not needed in other processing, it can be overwritten with intermediate results of successive stages until it finally contains the DFT A(k). Hence, except for a few working registers, no additional memory is required. Thus, where the outputs of the ith stage is denoted as Ahd i(k), the FFT process consists of pairs of computations of the form EQU A.sub.i (k)=A.sub.i-1 (k)+W.sup.m A.sub.i-1 (l) EQU A.sub.i (l)=A.sub.i-1 (k)+W.sup.m+N/2 A.sub.i-1 (l)
where the initial inputs A.sub.0 (k) are the a(n) in bit-reversed order. This basic computational pair is known as an FFT butterfly computation.
After completing each butterfly, the input pairs A.sub.i-1 (k) and A.sub.i-1 (l) can be replaced in memory by the output pair A.sub.i (k) and A.sub.i (l) because they will not be needed any more. Hence, the computation can proceed in place. The factor W.sup.m+N/2 suggests an additional saving of a factor of two in computation because W.sup.N/2 =-1. Therefore, each butterfly can be computed with only one complex multiplication.
Many different variations of the FFT algorithm are possible depending upon whether the input or output needs to be in bit-reversed order, the need for in-place computation, and the associated bookkeeping complexity. For example, if the input array is in natural order and retains the use of in-place butterflies, the resulting output array is in bit-reversed order. If both inputs and outputs are needed to be in natural order, then in-place computation is destroyed, and the bookkeeping (control code or circuitry) is more complex. Alternatively, a transpose network can be formed for a particular decimation-in-time (DIT) process by reversing the direction of each branch in the network. This produces a new class of FFT process known as the decimation-in-frequency (DIF) process.
As discussed above, the total load for an N-point DFT is 4N.sup.2 -2N addition and 4N.sup.2 multiplications. In contrast, the FFT algorithms require Nlog.sub.2 N computations. Thus, for a 1024-point DFT, this is a reduction by a factor of N.sup.2 over Nlog.sub.2 N, or roughly 100 to 1. Even with the 100 to 1 reduction, if computations on typical data rate at 20,000 samples per second, a computation for each 200 1024-point FFTs per second corresponds to 8 million adds and 8 million multiply operations per seconds, still a significant computational load for a Pentium processor is utilized. Not surprisingly, signal processing applications with insatiable needs for processing power such as radar, sonar, image processing and communications can not run in real-time on personal computers yet.
Although the number of frequency domain applications is as large as the more conventional time domain applications, the difficulty of implementing frequency domain applications as well as the cost of the frequency domain implementation has limited the viability of solving problems in the frequency domain. Thus, an enhanced process for quickly and efficiently performing FFTs is needed. Furthermore, it is desirable to accelerate the speed of performing FFTs without adversely affecting the compatibility of the personal computer with the installed software base.