1. Field of the Invention
The present invention relates generally to multi-dimensional data processing applications and in particular to transposing three dimensional (3D) arrays for multi-core processors.
2. Background Information
Transposing three dimensional (3D) arrays is a fundamental primitive operation used in many multi-dimensional data processing applications. Examples include seismic, medical imaging, media industry for 3D TV, biomedical, and 3D Fast Fourier Transform (FFT) applications. 3D FFT in turn is used in solving many mathematical problems including Poisson's equation in cylindrical coordinates, partial differential equations and x-ray diffraction data processing. Conceptually, 3D transpose simply changes the order of axis along dimensions; for example, given 3D data ordered in XYZ axis order, one 3D transpose operation would be to change the order to ZXY. However, with large data sets, as typical in above applications, such operation is challenging even for a massively parallel computing system. The operation is memory bound rather than computation bound; it involves much data communication and displacement rather than processing.
Conventional approaches to 3D transpose operations may be grouped into two approaches: The first approach physically reorders the data while the second approach performs reordering logically without moving any data. The latter approach does not require any data movement operation; however, it is not necessarily as efficient as the first approach, especially when memory is organized in a hierarchical structure. Memory hierarchy favors accessing data in blocks, thereby decreasing communication latencies. Moreover, usually the transposed data are later “stream” processed, which again require accessing data in blocks. Logical transpose accesses data in small granular level (at element level) fashion, which does not interface well with the underlying memory and processing architecture. Further, there is an associated mapping overhead. Therefore, physical transpose is usually preferred.
Performing physical transpose however has several shortcomings. One shortcoming involves the fact that it is usually sought to have the data transposed in-place to conserve memory (given large data size). This introduces complexity on the order of transpose and may limit the effective memory bandwidth, especially on shared-memory parallel systems. A second shortcoming involves the fact that all the data is transposed even if only a small subset is required (that will be the case if data access later on is sparse).