Discrete Fourier transforms (DFTs) are used extensively in a range of computing applications that include, but are not limited to, audio processing, image processing, medical imaging and molecular dynamics simulations of molecular systems. The dimensionality of the transform typically reflects the dimensionality of the underlying computing problem with DFTs of audio samples performed in one dimension (e.g. time), DFTs of planar image processing algorithms performed in two dimensions and DFTs of three-dimensional input data such as volumes performed in three dimensions.
The Cooley-Tukey Fast Fourier Transform (FFT) algorithm is an FFT algorithm for evaluating one-dimensional DFTs using a computer. It divides an input data sequence of length N into two smaller sequences, transforms them independently and recombines the results. The algorithm reduces asymptotic complexity from O(N2) for explicit evaluation of the DFT to O(N·log(N)) for specific values of N, such as N being a power of two. A common property of FFT algorithms performed in-place is the production of data in bit-reversed order where the least-significant bit of the input index expressed as a binary number becomes the most-significant bit of the output index, for example as described in Handbook of Real-Time Fast Fourier Transforms, by W. W. Smith and J. M. Smith (New York, USA: IEEE Press, 1995).
DFTs of multiple dimensions are often evaluated by performing DFTs of smaller dimensionality along independent dimensional axes, a process referred to herein as “decomposition.” Due to their often large size and stringent computing-time requirements the computation of multidimensional DFTs are frequently distributed across a plurality of processors.
The “pencil” decomposition technique evaluates multidimensional DFTs using a series of one-dimensional transforms applied along “pencils” of data in-place along each dimension successively (for example as described in “A Volumetric FFT for BlueGene/L” by M. Eleftheriou, J. E. Moreira, B. G. Fitch, and R. S. Germain, in HiPC, 2003, pp. 194-203). The term “pencil” refers to a one-dimensional N-element array of contiguous elements taken along a specific dimensional axis of the multidimensional transform data, where N is the length of the transform data in the dimension under consideration. This technique typically utilizes packaged one-dimensional FFT routines and requires that the output of each transform is produced in natural order.
The transformation of pencils along a given dimensional axis can occur independently of each other and can be distributed across multiple processors and evaluated in parallel to reduce execution time.
A critical operation that limits the scalability of this approach is the transpose of multidimensional data that must be performed between the evaluations of one-dimensional FFTs along each dimension. The transpose requires either communication of multidimensional data between one or more central nodes and a plurality of FFT processors during each transpose step or communication of multidimensional data directly between FFT processing nodes in an all-to-all fashion. Both cases require the transpose operation to re-order the multidimensional data such that each FFT processing node has local access to all of the data required for a given pencil prior to initiating the transformation for that pencil. The one-dimensional FFT algorithm may also impose additional requirements on the spatial locality and ordering of input data in memory to achieve peak processing throughput. To satisfy these requirements, the transpose operation must perform multiple non-contiguous memory accesses to reorganize multidimensional data into localized, contiguous one-dimensional pencils. Memory accesses of this nature can overwhelm the caching structures of modern processors, especially for multidimensional transforms of large dimensions. Processors tasked with evaluating FFTs are stalled waiting to receive new data while the transpose operations are in progress.
Other decomposition strategies have been formulated to reduce the communication and memory access overhead of transpose operations. The slab decomposition technique for three-dimensional DFTs eliminates one of the two transpose operations required by the pencil decomposition technique by treating data from two of the three dimensions of the transformation input data elements as a series of rectangular planes, and distributing the computation of the planes (or “slabs”) across multiple processors. These slabs are transformed locally by the processors to which they are assigned by using two-dimensional FFT libraries, by using one-dimensional FFT libraries capable of in-place transformations using strided memory accesses or by using one-dimensional FFT libraries and performing the two-dimensional transpose operation locally. This approach assumes the data for each plane, or “slab” is sufficiently localized and small enough to fit into a cache so that the transform of the slab along two dimensions can be computed efficiently by a processor. The drawback of this approach is that the scalability of this solution is bounded by the maximum number of elements along one of the three dimensions.
The volumetric decomposition strategy improves upon the pencil decomposition strategy by subdividing a three-dimensional input dataset over a three-dimensional mesh of processors interconnected using a toroidal network topology (for example as described in “A Volumetric FFT for BlueGene/L” by M. Eleftheriou, J. E. Moreira, B. G. Fitch, and R. S. Germain, in HiPC, 2003, pp. 194-203). Each processor is initially assigned a contiguous sub-cube of the input data. The computation begins with a transpose operation along one of the three axes that redistributes the data such that each processor contains an equal number of pencils to transform locally. This approach is superior to pencil decomposition because the transpose operation can occur efficiently across the links of the toroidal network connections. Data locality is also improved over the original pencil decomposition strategy as smaller amounts of data are being handled by each processor. Every processor is responsible for arranging data received from other processors into its local memory and performing the one-dimensional FFTs. Once this process has completed, another transpose is performed to restore the transformed data to its initial distribution. The same sequence is carried out along the other two axes. This strategy reduces the cost of the transpose operations of the pencil decomposition technique while retaining its scalability. However, twice as many transpose operations must be performed with volumetric decomposition and a specific network topology is required.