The present invention is directed to a method and apparatus for performing fast fourier transforms. More particularly the invention is directed to performing fast fourier transforms in a texturizing subsystem in a graphics processing device.
Computer generated graphics are being used today to perform a wide variety of tasks. Many different areas of business, industry, government, education, and entertainment are tapping into the enormous and rapidly growing list of applications developed for today's increasingly powerful computer devices.
Graphics have also been used for communicating ideas, data, and trends in most areas of commerce, science, and education. Until recently, real time user interaction with three dimensional (3D) model and pseudo-realistic images was feasible on only very high performance workstations. These workstations contain dedicated, special purpose graphics hardware. The progress of semiconductor fabrication technology has made it possible to do real time 3D animation with color shaded images of complex objects, described by thousands of polygons, on powerful dedicated rendering subsystems moving away from the need for a dedicated workstation. The most recent and most powerful workstations are capable of rendering completely life-like, realistically lighted 3D objects and structures. But other graphics processing devices are capable of object rendering which was previously only alterable using high performance work stations.
In addition to the generation of graphics, there has been a continuing development in image processing. Examples of imaging applications include electronic light tables (ELTs) which can be used for defense imaging, two dimensional (2D) and 3D visualization techniques for medical imaging or in other markets where image processing can be critical, e.g., pre-press and post-production to name a few.
In the past, it has been thought that dividing the image processing functionality and the graphics processing functionality into separate dedicated hardware was an advantageous architecture.
Co-pending application entitled A METHOD AND APPARATUS FOR PROVIDING IMAGE AND GRAPHICS PROCESSING USING A GRAPHICS RENDERING ENGINE, U.S. Ser. No. 956,537, filed on Oct. 23, 1997 describes a technique by which the graphics processing and image processing are performed with the same hardware. The description of the graphics rendering engine, including the image processing operations performed thereby, is incorporated herein by reference.
Fast Fourier Transform (FFT) is a key imaging operation in many imaging application domains, for example, medical imaging and image compression, in particular. Due to the special computational requirements of FFT, real-time FFT engines are traditionally implemented as black-box hardwares for special purposes. General purpose FFT sometimes needs the computational power of a supercomputer like a Cray-2.
A review of the mathematical background regarding FFTs will be illuminative of the complexities involved in implementing such an imaging operation.
Basic Formulations
The one dimensional Fourier transform of a continuous function f(x) is ##EQU1##
Intuitively speaking, the Fourier Transform decomposes a function into a set of sinusoidal basis functions at different frequencies. For each frequency u, F(u), a complex number (i.e., a number having real and imaginary components), tells you the magnitude and the phase of the associated sinusoidal function. Therefore, f(x) can be written as a summation of sinusoidal functions as follows ##EQU2##
In fact, we can reverse the role of f(x) and F(u), and call f(x) the Inverse Fourier Transform (IFT) of F(u).
The discrete version of the Fourier Transform of an array f(n) of size N is ##EQU3##
and the Inverse Discrete Fourier Transform (IDFT) of F(k) is ##EQU4##
The complex phase factor e.sup.-2.pi.j/N can be abbreviated as W.sub.N. We can rewrite Equation 4 as ##EQU5##
Computing F(k) directly from the formula above requires:
2N.sup.2 evaluations of trigonometric functions; and PA1 N(N-1) complex additions. PA1 Symmetry property: W.sub.N.sup.k+N/2 =-W.sub.N.sup.k PA1 Periodicity property: W.sub.N.sup.k+N =W.sub.N.sup.k PA1 Multiplicity property: W.sub.N.sup.Lk =W.sub.N/L.sup.k PA1 N(M+L+1) Complex multiplications; and PA1 N(M+L-2) Complex addition
The order of the complexity of the direct computations of DFT is N.sup.2, which is typically prohibitive for real-time applications.
It is known that the phase factor W.sub.N has the following properties:
These properties can be exploited to adopt a divide-and-conquer approach to calculate these Discrete Fourier transform (DFT). The computationally efficient algorithms that exploit these properties are known as Fast Fourier Transforms (FFTs) described as follows.
Suppose that N, the size of the f(n) array, can be factorized as a product of two integers N=L.times.M, then f(n) and F(k) can be arranged in two dimensional (2-D) arrays f(l, m) and F(q,p) as shown in FIG. 1 using the mapping EQU n=Ml+m, and k=Lq+p (6)
where EQU 0.ltoreq.l.ltoreq.L-1 0.ltoreq.m.ltoreq.M-1 EQU 0.ltoreq.p.ltoreq.L-1 0.ltoreq.q.ltoreq.M-1
The top half of the FIG. 1 shows the M (columns).times.L (rows) array of f(n) with an array size N=M.times.L. The bottom half of the figure shows the M.times.L array for the DFT F(k).
The DFT of f(n) can be reformulated as: ##EQU6##
But, EQU W.sub.N.sup.(Ml+m) (Lq+p) =W.sub.N.sup.MLql W.sub.N.sup.Mpl W.sub.N.sup.mqL W.sub.N.sup.mp
According to properties of the phase factor W.sub.N, as described above, EQU W.sub.N.sup.MLql =W.sub.N.sup.Nql =1, EQU W.sub.N.sup.Mpl W.sub.N/M.sup.pl =W.sub.L.sup.pl EQU W.sub.N.sup.(mqL) =W.sub.N/L.sup.mq =W.sub.M.sup.mq
Using these simplifications, ##EQU7##
For Example: N=12 N=4.times.3=12 select L=4, m=3
We would store f(n) as:
m = 0 m = 1 m = 2 col. 0 col. 1 col. 2 L = 0 row 0 f(0,0) = f(0) f(1,0) = f(1) f(2,0) = f(2) L = 1 row 1 f(0,1) = f(3) f(1,1) = f(4) f(2,1) = f(5) L = 2 row 2 f(0,2) = f(6) f(1,2) = f(7) f(2,2) = f(8) L = 3 row 3 f(0,3) = f(9) f(1,3) = f(10) f(2,3) = f(11) ##EQU8##
The computation of the DFT for f(n) can be divided into the following four steps: STEP 1. Compute the L-point DFT of each column m, for all m=0, . . . m-1 that is, we compute ##EQU9##
where, each point F(m, p) can be considered as a partially transformed function which is indexed by p, a frequency domain coordinate, and m, the same spatial domain coordinate F(m, p) storage can be represented as:
 F (0,0) F (1,0) F (2,0) F (0,1) F (1,1) F (2,1) F (0,2) F (1,2) F (2,2) F (0,3) F (1,3) F (2,3)
Step 2: Multiply each F (m, p) by W.sub.N.sup.mp where W.sub.N.sup.mp is referred to as the phase or twiddling factor so that: EQU G (m, p)=F (m,p) W.sub.N.sup.mp (10)
G(m, p) storage can be represented as:
 m = 0 m = 1 m = 2 p = 0 G (0,0) G (1,0) G (2,0) p = 1 G (0,1) G (1,1) G (2,1) p = 2 G (0,2) G (1,2) G (2,2) p = 3 G (0,3) G (1,3) G (2,3)
Step 3: Compute the M-point DFTs of each row p, where p=0, . . . L-1 ##EQU10##
The storage representation for the final DFT results in:
 1 = 0 F (0,0) = F (0) F (1,0) = F (4) F (2,0) = F (8) 1 = 1 F (0,1) = F (1) F (1,1) = F (5) F (2,1) = F (9) 1 = 2 F (0,2) = F (2) F (1,2) = F (6) F (2,2) = F (10) 1 = 3 F (0,3) = F (3) F (1,3) = F (7) F (2,3) = F (11)
STEP 4: Output F(k) by reading the resulting F(q, p) columnwise. The computational complexity for performing these FFT's involves
Thus, the computational requirements for this technique are smaller than those for computing DFTs directly.
A block diagram representation of a top-level view of this four step process is shown in FIG. 4. The data, that is the f(n) army of size N is arranged into an L.times.M matrix and is provided to the L-Point FFT, module 401, which performs step 1. The output of module 401 is provided to multiplier 402 by which the twiddling factor is applied as in step 2. The resulting product is presented to the M-point FFT module, 403 at which step 3 is performed. Finally, the transposer 404 reads out the result of the module 403 in a manner to provide output F(k) as in step 4.
Even more reduction of computations can be achieved by recursively factorizing L or M, or both, just as the N matrix was decomposed.
The first factor L can be selected to be small enough that the L-point FFT can be computed directly. When N is a power of some small number r, we can repeat the decomposition recursively until the computation involves only r-point FFTs. The algorithms are then called radix-r FFT, which can be constructed from r-point FFT building blocks. Radix-2 FFT is by far the most widely used FFT algorithms.
FIG. 4 provides a block diagram of a high level view of a system that performs the four step processing using a 4 point FFT as the L-point FFT module. Again, presume the matrix in question is of size N. When the number of data points N is a power of 4; that is N=4.sup.v we can further reduce the computational requirements by employing a radix -4 DFT algorithm. In this case M=4 and L=N/4 m,q=0, 1,2,3 and l,p=0, 1, . . . ,N/4-1 n=41+m ##EQU11##
So, the storage arrangement becomes EQU f(n)=f(m,l)=f(41+m) ##EQU12## ##EQU13##
Further, we define a "butterfly" to be that computation where, given two complex numbers say a and b, multiply b by W.sub.N.sup.r and then add and subtract the product from a to form two new complex numbers A and B.
As shown in FIG. 2, a butterfly involves one complex multiplication and two complex additions. Using EQN 12 and the butterfly notation, the expression for combining the ##EQU14##
point DFT's defines a radix-4 butterfly that can be expressed in matrix form as: ##EQU15##
or in butterfly notation as shown in FIG. 3.
Since W.sub.N.sup.0 =1, each butterfly involves 3 complex multiplications and 12 additions.
This process can be repeated recursively V times. Hence, the radix-4 algorithm consists of V stages, where each stage contains ##EQU16##
butterflies. Consequently, the computational complexity of this algorithm is reduced to ##EQU17##
log.sub.2 N complex multiplications and ##EQU18##
log.sub.2 N complex additions.
FIG. 5 illustrates a block diagram, similar to that of FIG. 4 but reflecting the use of a 4-point FFT module 501. Also, the M-point FFT 403 is replaced by the N/4-point FFT module 503.
An array of size N can be arranged in a hypercube with log.sub.4 N dimensions, and treat the N-point FFT as log.sub.4 N passes of 4-point FFTs along each of the dimensions of the hypbercube. FIG. 6 illustrates an example of such a hypercube for a 64-point FFT.
For a 64-point FFT, N=64, the computation can be considered as 3 passes of 16 (N/4, N=64) 4-point FFTs along each of the dimensions on a 4.times.4.times.4 cube. The inputs are stored in x-y-z order, and the outputs are read in z-y-x order. The order of the dimensions in which the output are retrieved is reversed from that in which the data are stored. For general N-point FFT using radix-4 algorithm, the hypercube metaphor can be explained mathematically as follows. Let p=log.sub.4 N, then the index n to the array f(n) can be represented in binary form as: EQU n=b.sub.p-1 b.sub.p-2 . . . b.sub.1 b.sub.0 (13)
where b.sub.i =0 or 1 for i=0, . . . , p-1
By grouping every two bits, we have EQU n=d.sub.q-1 d.sub.q-2 . . . d.sub.1 d.sub.0 (14)
where d.sub.i =b.sub.2i+1 b.sub.2i, and q=p/2.
For example, for n=b.sub.7, b.sub.6, . . . , b.sub.1, b.sub.0 then d.sub.3 =(b.sub.7, b.sub.6), d.sub.2 =(b.sub.5, b.sub.4) etc.
From here, one-dimensional f(n) can be represented as q-dimensional EQU f (d.sub.0, d.sub.1 , . . . d.sub.q-2,d.sub.q-1) (15)
The index k of the output F(k) corresponding to f(n) is related to n by EQU k=d.sub.0 d.sub.1 . . . d.sub.q-2 d.sub.q-1 (16)
We call the reversal of the pairs of bits as nibble reversal.
Comparing the complexity of the various DFT algorithms discussed it is seen that:
TABLE 1 Comparing the orders of complexity of different DFT algorithms. Order of Direct Complexity Computation L .times. M Decomp. Radix-r Complex N.sup.2 N (M + L) N(rlog.sub.r N) Multiplication Complex N.sup.2 N (M + L) N(rlog.sub.r N) Addition For example, N = 64, M = 8, L = 8, r = 4
TABLE 2 Order of Direct Complexity Computation L .times. M Decomp. Radix-r Complex 64.sup.2 = 4096 64 .times. 16 = 1024 64(4log.sub.4 64) = 768 Multiplication Complex 64.sup.2 = 4096 64 .times. 16 = 1024 64(4log.sub.4 64) = 768 Addition
Thus, the radix r FFT calculation clearly reduces the number of computations.
It would be beneficial if a technique could be developed to more efficiently perform radix r FFTs, particularly a technique that takes advantage of existing hardware thereby avoiding the need to develop or employ a dedicated digital signal processor, such as the Intel I860, to perform these transforms.