Computationally bound workloads generally have, at their core, one or more loops which iterate over a set of data performing computations. The optimization of these core loops is key to improving application performance and has been an area of extensive study in program optimization. One well-known technique used to improve the execution performance of so-called tight loops is to replicate the code of the loop body several times in succession. The replication, or unrolling process, serves two purposes. In a first case, the replication reduces the number of backwards branches executed at runtime, which are typically expensive instructions on most processors. In a second case, the replication provides an opportunity for inter-iteration optimization to be performed.
In one example, the process of loop unrolling, in a traditional form, as performed in a runtime dynamic system considers loops to be unrolled by exactly replicating the loop bodies and connecting the replicated loop bodies. Another example is directed toward reducing an amount of code that must be duplicated when unrolling a loop by only unrolling ‘hot’ traces. This means that all loop iterations branch to the same control code, which merges into the control flow of the unrolled loop at the bottom of the unrolled loop. This technique while reducing code size does not expose many optimization opportunities because of the control flow being merged at the bottom of the loop. A typical problem with traditional unrolling techniques is that all loop back-edges become inter-iteration edges connecting the unrolled loop sequences during loop unrolling. This limitation can reduce optimization opportunities because information associated with rarely executed parts of each iteration must be considered when optimizing the next unrolled loop iteration.