Computer systems often use a storage hierarchy consisting of several types of memory—cache, RAM, and disk storage, for example. Ideally, memory accesses incur the same unit cost, regardless of the memory location being accessed. In such a system, operations that reorder the elements of an array (e.g., permutation of axes), can be performed by manipulating address computations, leaving the array's memory layout unaltered. No additional memory costs are incurred.
In reality, storage hierarchies tend to have non-uniform access costs. Usually, a small, fast, internal memory that allows uniform, element-wise access is paired with a large, slow, external memory that requires (or performs best with) multi-element block accesses. For example, a microprocessor cache may be paired with RAM. Such a hierarchical memory architecture induces a dependence between array layout and traversal order: axes whose strides are smaller than the optimal block size can be traversed more efficiently (i.e., with fewer block reads) than those with larger strides. Thus a memory layout that is optimal for one permutation of an array's axes may be highly inefficient for another. FIG. 1A shows a 4×4 array 100, containing elements a0-a15, stored in row-major order in an external memory having a block size of four elements (thus each row 110(a-d) requires a single block read). Although this is a 2-dimensional array, the problem also arises with arrays of larger dimensions. As shown in FIG. 1B, traversing a row of the array (e.g., elements a0, a1, a2 and a3, shown in bold type) requires only a single block read. However, as shown in FIG. 1C, traversing a column (e.g., elements a0, a4, a8 and a12, shown in bold type) requires four block reads. Thus, a row-major traversal of the entire array requires 4 block reads, while a column-major traversal will require between 4 and 16 block reads, depending on the size of the internal memory. If the internal memory holds fewer than four blocks at a time, a column-major traversal of the array will have to load some blocks multiple times, potentially causing delays in processor execution and generating unnecessary memory traffic.
In some cases, the inefficiency resulting from this situation can be so large that the cost of physically rearranging the array to permit more efficient access is less than the cost of block reads saved in subsequent traversals of the rearranged array. In such cases, it is important to perform the physical rearrangement efficiently (i.e., with the minimal number of block transfer operations). Donald Fraser, “Array Permutation by Index-Digit Permutation,” Journal of the ACM, vol. 23, no. 2, April 1976, pp. 298-309, teaches that an array may be reordered according to permutation of the digits forming its indices. Fraser describes efficient implementations of some specialized bit permutations, but does not explain how to generate a schedule for general array transposition. Thomas H. Cormen, “Fast Permuting on Disk Arrays,” Journal of Parallel and Distributed Computing, vol. 17, 1993, pp. 41-57, teaches a method of generating a schedule for a set of bit operations that includes transposition, but the method is relatively complex.