1. Field of the Invention
The present invention relates generally to a data processing system and more specifically to a method, computer program product and system for generating versioned code with runtime alignment for single instruction multiple data units.
2. Description of the Related Art
Single Instruction Multiple Data (SIMD) units operating on packed, fixed-length vectors, such as AltiVec for IBM®, have become a popular addition to most general-purpose micro-processors. However, the difficulty of generating code (known as simdization) for such processors remains a hindrance to their wider acceptance.
The alignment constraint of SIMD memory units is a hardware feature that can significantly impact the effectiveness of simdization. For example, the memory operations in an AltiVec unit can only access 16-byte contiguous memory from 16-byte aligned addresses. In order to satisfy alignment constraints imposed by the hardware, the software/compiler must insert data reorganization codes to explicitly realign data during the simdization process. Additional alignment handling overhead may be added if alignments of some memory accesses in the codes are only known at runtime (referred to as runtime alignment). Embodiments of the present invention generate more efficient simdized codes in the presence of runtime alignment. To demonstrate the scenarios that can benefit from the present invention, consider the loop example in FIGS. 1A and 1B where the bases of arrays a, b, and c are aligned at 16-byte memory boundaries (illustrative examples of code being in the C programming language). For the loop in FIG. 1A, the base of arrays a, b, and c are 16-byte aligned, and n is a runtime value. Because of the unknown value n, accesses a[i+2+n], b[i+1+n], and c[i+3+n] (when i=0) have respectively a runtime alignment of (4n+8) mod 16, (4n+4) mod 16, and (4n+12) mod 16. Such loops are common in the internal representation of a compiler after loop normalization if the original loop has a lower bound of n. The modulus operator is denoted in some figures by the percent sign (%).
Similarly, for the loop in FIG. 1B, since a, b, c are pointers passed into a function, their alignments may not be known at compile-time.
One approach to speed up simdized loops with runtime alignment is the use of loop versioning. Code versioning is a well known technique that creates multiple specialization of a loop, each of which is guarded by different runtime conditions. These guard conditions decide, at the runtime, which version of the loop is to be executed.
The most common loop versioning technique for runtime alignment is to create a specialization of the loop when all runtime accesses are aligned as one version (A. Bik, M. Girkar, P. M. Grey, and X. Tian. Automatic Intra-Register Vectorization for the Intel Architecture. International Journal of Parallel Programming, (2):65-98, April 2002). Since this technique creates a version under the condition that all runtime alignments become alignment zero, we call this technique as “versioning for absolute alignment-zero”. However, note that, in the above example, because of the relative difference between n, n+1, and n+2, the runtime conditions, (n mod 4)==0 && (n+1) mod 4==0 && (n+2) mod 4==0, can never be satisfied no matter what the value of n is.
Another technique is to construct a pre-loop that peels, at runtime, the original loop until all accesses in the loop reach the aligned boundary (S. Larsen, E. Witchel, and S. Amarasinghe. Increasing and Detecting Memory Address Congruence. In Proceedings of 11th International Conference on Parallel Architectures and Compilation Techniques, September 2002). In this case, the pre-loop will exit and enter a version of the loop where all accesses are aligned. In essence, it creates two versions of the loop: one is the pre-loop which is in the sequential mode; and the other is the simdized loop with all aligned accesses. The versioning condition is determined by the pre-loop exit condition.
This approach has two major drawbacks. First, the pre-loop contains runtime checks of the guarding condition inside the loop body, thus is very expensive. When the exit conditions are not satisfied at runtime, the sequential version will be much slower than the original sequential loop. Secondly, the exit-condition requires all accesses with runtime alignments to reach 16-byte aligned boundary at the same time. In the previous approach (versioning for absolute alignment-zero), even though the exit condition can never be satisfied, it still versions the loop.