1. Technical Field of the Invention
The present invention generally relates to a field of distributed-memory message-passing parallel multi-node computers and associated system software, as applied for example to computations in the fields of science, mathematics, engineering and the like. More particularly, the present invention is directed to a system and method for efficient implementation of a multidimensional Fast Fourier Transform (i.e., “FFT”) on a distributed-memory parallel supercomputer.
2. Description of the Prior Art
Linear transforms, such as the Fourier Transform (i.e., “FT”), have widely been used for solving a range of problems in the fields of science, mathematics, engineering and the like. The FT alters a given problem into one that may be more easily solved, and the FT is used in many different applications. For example, for a system of N variables, the FT essentially represents a change of the N variables from coordinate space to momentum space, where the new value of each variable depends on the values of all the old variables. Such a system of N variable is usually stored on a computer as an array of N elements. The FT is commonly computed using the Fast Fourier Transform (i.e., “FFT”). The FFT is described in many standard texts, such as the Numerical Recipes by Press, et al. (“Numerical Recipes in Fortran”, pages 490-529, by W. H. Press, S. A. Teukolsky, W. A. Vetterling and Brian P Flannery, Cambridge University Press, 1986, 1992, ISBN: 0-521-43064-X). Most computer manufacturers provide library function calls to optimize the FFT for their specific processor. For example, the FET is fully optimized on the IBM's RS/6000 processor in the Engineering and Scientific Subroutine Library. These library routines require the data (i.e., the foregoing elements) necessary to perform the FFT be resident in a memory local to a node.
In a multidimensional FFT, N elements of a multidimensional array are distributed in a plurality of dimensions across nodes of a distributed-memory parallel multi-node computer. Many applications that execute on distributed-memory parallel multi-node computers spend a large fraction of their execution time on calculating the multidimensional FFT. Since a motivation for the distributed-memory parallel multi-node computers is faster execution, fast calculation of the multidimensional FFT for the distributed array is of critical importance. The N elements of the array are initially distributed across the nodes in some arbitrary fashion particular to an application. To calculate the multidimensional FFT, the array of elements is then redistributed such that a portion of the array on each node consists of a complete row of elements in the x-dimension. A one-dimensional FFT on each row in the x-dimension on each node is then performed. Since the row is local to a node and since each one-dimensional FFT on each row is independent of the others, the one-dimensional FFT performed on each node requires no communication with any other node and may be performed using abovementioned library routines. After the one-dimensional FFT, the array elements are re-distributed such that a portion of the array on each node consists of a complete row in the y-dimension. Thereafter, a one-dimensional FFT on each row in the y-dimension on each node is performed. If there are more than two dimensions for the array, then the re-distribution and a one-dimensional FFT are repeated for each successive dimension of the array beyond the x-dimension and the y-dimension. The resulting array may be re-distributed into some arbitrary fashion particular to the application.
The treatment of the x-dimension and the y-dimension in sequence is not fundamental to the multidimensional FFT. Instead, the dimensions of the array may be treated in any order. For some applications or some computers, some orders may take advantage of some efficiency and thus have a faster execution than other orders. For example, the initial distribution of the array across the nodes, which is in some arbitrary fashion particular to the application, may coincide with the distribution necessary for the one-dimensional FFTs in the y-dimension. In this case, it may be fastest for the multidimensional FFT to treat the y-dimension first, before treating the x-dimension and any other remaining dimensions.
In the implementation of the multidimensional FFT described above, each re-distribution of the array between the one-dimensional FFTs is an example of an “all-to-all” communication or re-distribution. In the all-to-all re-distribution, each node of the distributed-memory parallel multi-node computer sends unique data (i.e., elements of the array) to all other nodes utilizing a plurality of packets. As above-mentioned, fast calculation of the multidimensional FFT on the distributed-memory parallel multi-node computer, is of critical importance. In the implementation described above, typically a large fraction of the execution time is spent to re-distribute the array across the nodes of the distributed-memory parallel multi-node computer. More particularly, a large fraction of execution time is spent on the “all-to-all” re-distribution of elements of the array across the nodes of the distributed-memory parallel multi-node computer.
Therefore there is a need in the art for providing a system and method for efficiently implementing the multidimensional FFT on the distributed-memory parallel supercomputer. In particular, there is a need in the art for providing a system and method for efficiently implementing the “all-to-all” re-distribution on the distributed-memory parallel supercomputer for efficiently implementing the multidimensional FFT.