The multidimensional Fast Fourier Transform (FFT) is a widely-used computational tool. For example, the FFT has been applied to solve problems in signal processing, applied mechanics, sonics, acoustics, biomedical engineering, radar, communications, and the analysis of stock market data.
However, when a multidimensional FFT algorithm is implemented in a program code that is executed in a data processing system, the FFT computation typically accounts for a substantial percentage of the run-time of the program code. A conventional approach to reduce the computation time is to parallelize the FFT computation to execute concurrently on more than one processor.
Given a matrix F that has M rows and N columns, a typical approach to compute the FFT of matrix F is shown in the following steps:
1.) perform N one-dimensional FFTs of length M on the columns of the original matrix;
2.) transpose the resultant matrix;
3.) perform M one-dimensional FFTs of length N on the columns of the transposed matrix; and
4.) transpose the new resultant matrix.
Another conventional approach to compute the FFT of matrix F is shown in the steps below:
1.) perform M one-dimensional FFTs of length N on the rows of the original matrix;
2.) transpose the resultant matrix;
3.) perform N one-dimensional FFTs of length M on the rows of the transposed matrix; and
4.) transpose the new resultant matrix.
Both of these conventional FFT algorithms suffer from a performance problem on parallel-processing systems. Namely, when a processor completes one of the steps, the processor typically stops and waits for the other processors to complete that step before proceeding to the next step. For example, a processor does not begin transposing the matrix in step 2 until all processors have completed performing the FFTs in step 1; step 3 does not begin until all processors have completed step 2; and step 4 does not begin until all processors have completed step 3. Therefore, in a highly-parallel system, hundreds of processors may go idle waiting for the last processor to finish one of the steps. This, in turn, leads to a substantial loss of computational efficiency.