Optimizing compilers are software systems for translation of programs from higher level languages into equivalent object or machine language code for execution on a computer. Optimization generally requires finding computationally efficient translations that reduce program run time. Such optimizations may include improved loop handling, dead code elimination, software-pipelining, improved register allocation, instruction prefetching, and/or reduction in communication cost associated with bringing data to the processor from memory.
Certain programs would be more useful if appropriate compiler optimizations are performed to decrease program run time. One such program element is a sparse matrix routine appropriate for matrices constituted mostly of zero elements. Instead of simultaneously storing in computer memory every element value, whether it is zero or non-zero, only integer indices to the non-zero elements, along with the element value itself, are stored. This has the advantage of greatly decreasing required computer memory, at the cost of increasing computational complexity. One such computational complexity is that array elements must now be indirectly accessed, rather than directly, i.e. the address of any array element cannot simply be determined as an offset from the base by the size of the array type, but rather the index of the array element must first be accessed and that index provides the needed offset from the base.
Common compiler optimizations for decreasing run time do not normally apply to such indirectly accessed sparse matrix arrays, or even straight line/loop code with indirect pointer references, making suitable optimization strategies for such types of code problematic. For example, statically disambiguating references to indirectly accessed arrays is difficult. A compiler's ability to exploit a loop's parallelism is significantly limited when there is a lack of static information to disambiguate stores and loads of indirectly accessed arrays.
A high level language loop specifies a computation to be performed iteratively on different elements of some organized data structures, such as arrays, structures, records and so on. Computations in each iteration typically translate into loads (to access the data), computations (to compute on the data loaded), and stores (to update the data structures in memory). Achieving higher performance often entails performing these actions, related to different iterations, concurrently. To do so, loads from many successive iterations often have to be performed before stores from current iterations. When the data structures being accessed are done so indirectly (either through pointer or via indirectly obtained indices) the dependence between stores and loads is dependent on data values of pointers or indices produced at run time. Therefore, at compile time there exists a “probable” dependence. Probable store-to-load dependence between iterations in a loop creates the ambiguity that prevents the compiler from hoisting the next iteration's loads and the dependent computations above the stores from the prior iteration(s). The compiler cannot assume the absence of such dependence, since ignoring such a probable dependence (and hoisting the load) will lead to compiled code that produces incorrect results.
Accordingly, conventional optimizing compilers must conservatively assume the existence of store-to-load (or vice versa) dependence even when there is no dependence. This is generally referred to as “memory dependence.” Compilers are often not able to statically disambiguate pointers in languages such as “C” to determine whether they may point to the same data structures, which are generally referred to as “collisions.” This prevents the most efficient use of speculation mechanisms that allow instructions from a sequential instruction stream to be reordered. Conventional out-of-order uniprocessors cannot reorder memory access instructions until the addresses have been calculated for all preceding stores. Only at this point will it be possible for out-of-order hardware to guarantee that a load will not be dependant upon any preceding stores.
A number of compilation techniques have been developed to improve the efficiency of loop computations by increasing instruction-level parallelism (ILP). One such method is sofiware-pipelining, which improves the performance of a loop by overlapping the execution of several independent iterations. The number of cycles between the start of successive iterations in software-pipelining is called the initiation interval, which is the greater of the resource initiation interval and the recurrence initiation interval. The resource initiation interval is based on the resource usage of the loop and the available processor resources. The recurrence initiation interval of the loop is based on the number of cycles in the dependence graph for the loop and the latencies of a processor. A higher instruction-level parallelism for the loop can be realized if the recurrence initiation interval of the loop is less than or equal to its resource initiation interval. Typically, this condition is not satisfied for loops whose computations involve sparse arrays/matrices. The body of such a loop typically starts with a load whose address in itself is an element of another array (called the index array) and ends with a store whose address is an element of the index array. In the absence of static information, the compiler does not know about the contents of the elements of the index array. Hence, it must assume that there is a loop-carried dependence edge from the store in one iteration to the load in the next iteration. This makes the recurrence initiation interval much higher than resource initiation interval.
Techniques to reduce recurrence initiation interval and improve instruction-level parallelism use methods, such as loop unrolling and data speculation to exploit parallelism within a loop body. However, loop unrolling does not exploit parallelism across the loop-closing back edge, i.e., there is parallelism only within an unrolled loop and not outside the unrolled loop. This generally results in poor instruction-level parallelism at the beginning and end of the loop body. In the case of data speculation, the technique assumes almost no collisions exist in the index array to achieve maximal performance within the loop body. However, performance gain from data speculation significantly diminishes if there are any collisions in the index array, resulting in a lower performance.