1. Field of the Invention
Embodiments of the invention generally relate to performing efficient fast Fourier transforms (FFTS) on multi-core processor architectures. More specifically, embodiments of the invention relate to converting data into a format tailored for efficient FFTs on SIMD multi-core processor architectures.
2. Description of the Related Art
Some currently available processors support “single instruction, multiple data” (SIMD) extensions. SIMD indicates that a single instruction operates on multiple data items in parallel. For example, an “add” SIMD instruction may add eight 16-bit values in parallel. That is, the add operation (a single operation) is performed for eight distinct sets of data values (multiple data) in a single clock cycle. Typically, the data values may be supplied as elements of a vector. Accordingly, SIMD processing is also referred to as vector processing. SIMD instructions dramatically increase execution speed by performing multiple operations as part of a single instruction. Well known examples of SIMD extensions include multimedia extension (“MMX”) instructions, SSE instructions, and vectored multimedia extension (“VMX”) instructions.
Calculating FFTs efficiently on SIMD multicore processors is difficult. For large, one-dimensional FFTs (1D FFTs), a greater amount of parallelism may be obtained due to the larger groups of independent blocks of data processing. However, the 1D FFT is a fundamentally recursive algorithm with complexity O(N log N). Thus, for smaller-sized 1D FFTs, the amount of single-row parallelism is very small. Moreover, current libraries for performing FFTs are not tailored towards an FFT performed on a relatively smaller array of data (e.g., an FFT performed on an image size of 256×256 pixels, 512×512 pixels, or 1024×1024 pixels). Although a degree of SIMD parallelism is extracted from the 1D FFT at larger sizes, only a small amount of intra-row algorithm parallelism is extracted at smaller sizes. Furthermore, current libraries for multi-core FFTs are standalone and do not allow the functional pipelining of work required for compute-operation-to-input/output (IO) optimization.