In the last decade, many applications optimized for vector computers have moved to microprocessor based computers or MCs. The reason for the move is two-fold: microprocessors are cheap and the peak performance of the fastest microprocessor is converging to the peak performance of the fastest vector processor. However, obtaining a speed close to the peak speed for large applications on MCs can be difficult. Vector computer memory systems are designed to deliver one or more elements to the vector registers per clock period, while MCs are not designed to deliver as much bandwidth from their memory systems. Instead, the MCs rely on data caches to allow fast access to the most recently used data. Applications that have a very high reuse of the data in cache run well. For problems that have low data reuse, the performance is typically soley determined by the ability of the processor to load data from memory. Consequently, the goal for most algorithms is to access the memory system as infrequently as possible.
The fast Fourier transform (FFT) is an efficient way to calculate power-of-two one-dimensional discrete Fourier transforms (DFT). The FFT is described in the article by Cooley, J. W., and Tukey, J. W, entitled "An algorithm for the machine calculation of complex Fourier series", Math. Comp., 19: 297-301 1965, which is hereby incorporated by reference. Different algorithms are needed depending upon whether the datasets fit in cache. For datasets that are cache resident, the algorithms that have been developed for vector computers frequently work well in MC systems. This is because the cache on many microprocessors can deliver more bandwidth than their floating-point functional units can consume. This is similar to the case for vector computers. For example, the Stockham based algorithm, as described in the article by Cochrane et al., entitled "What is the fast Fourier transform?", IEEE Trans. Audio and Electroacoustics, AU-15: 45-55, 1967, which is hereby incorporated by reference, works well in this situation. Other variants of the Stockham algorithm exist, as well as other types of vector computer algorithms exist. These types of techniques work well for datasets that fit in cache.
However, when a problem exceeds cache size, the performance of Stockham based or other vector algorithms decreases dramatically in MCs. Consequently, one goal for large FFTs is to minimize the number of cache misses. Prior art techniques have attempted to reduce cache misses, such as the four-step and six-step approaches as described in the article by Bailey, D, entitled "FFTs in External or Hierarchical Memory", in The J. Supercomputing, 4: 23-35, 1990, which is hereby incorporated by reference. Other variants of these approaches exist. The basic four-step approach is composed of row simultaneous FFTs, a transpose, and a twiddle multiply. The basic six-step approach is composed of column simultaneous FFTs, three transposes, and a twiddle multiply. Although the six-step approach requires three data transposes, each column FFT may fit in the data cache.
One formulation of the six-step approach takes an input array X of size n=k.times.m, a work array Y of size n, and an array of roots of unity (twiddle factors) U of size n and comprises the following steps.
1. transpose X(k, m) to Y(m, k). PA1 2. k simultaneous FFTs of length m using Y. PA1 3. transpose Y(m, k) to X(k, m). PA1 4. twiddle factor multiply U(k,m).times.X(k,m)=Y(k,m). PA1 5. m simultaneous FFTs of length k using Y. PA1 6. transpose Y(k, m) to X(m, k).
Although the individual FFTs may fit in cache, many misses may occur in the other steps, however the number of misses is lower than the Stockham based FFTs. Note that the FFT steps are short, one-dimensional contiguous FFTs. Steps 1 through 6 together are mathematically equivalent to performing one large FFT on the input data in array X of size n.
FIG. 1 graphically depicts the data movement on the six-step algorithm. The input is the vector X 11, which is represented as an array of size n which equals m.times.k. The first step 10 transposes the data in array X 11 into a work array Y 12 also of size n. Thus, the data in position i,j of array X 11, becomes the data of the position j,i of array Y 12. The transposition is performed so that the data will have better cache characteristics, i.e. the data in the cache is more likely to be reused. The second step 13 performs m short, one-dimensional contiguous FFTs, each of length k, on the data in array Y 12. Note that the size of the FFTs is k. Further note that short FFTs are usually about the square root in size of the original problem, and thus, more likely to fit into cache. The third step 14 transposes the data back from array Y 12 to array X 11. The fourth step 15 multiples the X array by the coefficients of array U 16, which has a size n, and stores the results into the work array Y 12. The array U 16 comprises the twiddle factors, which are previously calculated trigometric coefficients that are used for scaling. Note that in steps 4 through 6, the work array Y 12 is considered to be a k.times.m array, instead of a m.times.k array as in steps 1 through 3. This ensures that the step 5 FFTs are contiguous in memory. The fifth step 17 performs k FFTs, each of length m, similar to step 2, on the data stored in array Y. To complete the transform, the sixth step 18 transposes the data back into array X 11 from array Y 12. This implies that in step 6, the array X 11 is considered to be a m.times.k array, instead of a k.times.m array as in steps 1 through 3.
Since X 11, Y 12, and U 16 are all size n, and all will be stored into cache from the memory, then cache misses would only typically occur when n is greater than 1/3 of the cache size. FIG. 2 depicts typical size relationships of cache 21 and memory 22, where the memory 22 is shown to be much larger than the cache 21. Each block of data in the memory is the same size or larger than the cache, thus as additional blocks of data are loaded into cache, the prior data is overwritten. Each time a block of data is loaded into cache, cache misses occur. These cache misses are required as data must be moved from the memory and into cache. FIG. 3 depicts the size relationships between the cache 21 and the memory 22, however, the data blocks X 31 and Y 32 are larger than the cache. Therefore, as each portion of the block X 31 is loaded into memory, cache misses are accrued from the data movement. Note that cache misses occur when data is moved from memory to cache, as well as cache to memory. Typically, when data is stored, the data must first be loaded from memory to cache, and then stored back to the memory, thus accruing double the misses that a load accrues, as two data movements have occurred. For example, in storing data to Y 32, the cache first loads 34 the data into cache, and then stores 33 the data back into memory. Therefore, the mechanism of FIG. 1 will incur many cache misses if the size of n exceeds 1/3 the size of the cache, since the mechanism of FIG. 1 uses three arrays, X, Y, and U, all of size n.
Therefore, there is a need in the art for a FFT mechanism that reduces the number of cache misses that occur when the data block is larger 1/3 of the size of the cache.