1. Field
The described embodiments relate to techniques for improving the performance of computer systems. More specifically, the described embodiments relate to running-AND, running-OR, running-XOR, and running-multiply instructions for processing vectors in program code.
2. Related Art
Recent advances in processor design have led to the development of a number of different processor architectures. For example, processor designers have created superscalar processors that exploit instruction-level parallelism (ILP), multi-core processors that exploit thread-level parallelism (TLP), and vector processors that exploit data-level parallelism (DLP). Each of these processor architectures has unique advantages and disadvantages which have either encouraged or hampered the widespread adoption of the architecture. For example, because ILP processors can often operate on existing program code that has undergone only minor modifications, these processors have achieved widespread adoption. However, TLP and DLP processors typically require applications to be manually re-coded to gain the benefit of the parallelism that they offer, a process that requires extensive effort. Consequently, TLP and DLP processors have not gained widespread adoption for general-purpose applications.
One significant issue affecting the adoption of DLP processors is the vectorization of loops in program code. In a typical program, a large portion of execution time is spent in loops. Unfortunately, many of these loops have characteristics that render them unvectorizable in existing DLP processors. Thus, the performance benefits gained from attempting to vectorize program code can be limited.
One significant obstacle to vectorizing loops in program code in existing systems is dependencies between iterations of the loop. For example, loop-carried data dependencies and memory-address aliasing are two such dependencies. These dependencies can be identified by a compiler during the compiler's static analysis of program code, but they cannot be completely resolved until runtime data is available. Thus, because the compiler cannot conclusively determine that runtime dependencies will not be encountered, the compiler cannot vectorize the loop. Hence, because existing systems require that the compiler determine the extent of available parallelism during compilation, relatively little code can be vectorized.