The presence of loops in program code is a source of significant amount of instruction-level Parallelism (ILP). In a superscalar architecture, loop parallelization is implemented through the combination of wide out-of-order execution and dynamic register renaming. For each iteration, the same instructions of the loop body are allocated into the scheduling window and a hardware renaming logic dynamically assigns new physical register addresses to logical register addresses encoded in the instructions. This allows overlapping execution of multiple loop iterations through out-of-order scheduling of instructions from different loop iterations, thereby exploiting inter-iteration parallelism inherent in loops.
A highly parallel strand-based architecture may be utilized to more efficiently exploit inter-iteration parallelism when compared to a superscalar architecture (exploit more ILP from loops in comparison to a superscalar architecture). In this approach for loop parallelization, multiple loop iterations are processed simultaneously via multiple strands, resulting in out-of-order fetch, allocation, and execution of instructions from different iterations. Thus, instructions of a particular iteration can be executed even if instructions of the previous iterations have not been fetched yet, which is impossible in a superscalar architecture due to in-order fetch and allocation. The dynamic register renaming technique used in superscalar architectures is not applicable or efficient in a highly parallel strand-based architecture due to out-of-order instruction fetch and allocation, as well as the need to support a much larger execution width (which is made possible by processing multiple loop iterations in parallel) than practical renaming hardware can afford.