Source code written in a high level computer language is translated by a compiler into executable instructions. The source code may include a multi-dimensional array having one or more indices. A one-dimensional array corresponds to a vector, while a multi-dimensional array corresponds to a matrix. A one-dimensional array is a sequence of elements stored consecutively in memory. The type of an array element can be any of the basic data types, such as integer, logical, etc. Multi-dimensional arrays include elements whose location in the array is identified by two or more indices. For instance, a two dimensional array has two indices, a row index and column index and a one-dimensional array has only one index.
Processor speed has been increasing much faster than memory speed over the past several generations of processor families. As a result, the compiler must be very aggressive in memory optimizations in order to bridge the gap
When subscripts are accessed in memory order, elements in the array are usually accessed in the order in which they are stored so memory accesses retrieve elements from memory in the order in which they are accessed in the code providing good memory locality with respect to the order in which the data is accessed. This situation of good memory locality is referred to as unit stride. The stride of an array refers to the number of locations in memory between successive array elements, measured in units of the size of the array elements. An array with stride one has elements that are contiguous in memory and are said to have unit stride, where array elements are sequentially accessed in the code in the order in which they are contiguously stored in memory.
In many cases, source code is written in a manner that results in poor memory locality when the array elements subscripts are not accessed in memory order, which is referred to as non-unit stride. The traditional methods of optimization by loop interchange and loop distributing cannot be applied. Memory dependencies may also prevent the outermost loop from being parallelized. In the current art, loop interchange is performed by loop distribution and loop fusion. Loop fusion is enabled by peeling off some loop iteration so the loops are conformable. Loop peeling creates loops which usually prevent loop interchange due to existence of non-perfect loop nests.