1. Technical Field
The present invention relates in general to a system and method for vectorizing loop code for execution on Single Instruction Multiple Datapath (SIMD) architectures that impose strict alignment constraints on the data.
2. Description of the Related Art
Multimedia extensions (MMEs) have become one of the most popular additions to general-purpose microprocessors. Existing multimedia extensions can be characterized as Single Instruction Multiple Datapath (SIMD) units that support packed fixed-length vectors. The traditional programming model for multimedia extensions has been explicit vector programming using either (in-line) assembly or intrinsic functions embedded in a high-level programming language. Explicit vector programming is time-consuming and error-prone. A promising alternative is to exploit vectorization technology to automatically generate SIMD codes from programs written in standard high-level languages.
Although vectorization has been studied extensively for traditional vector processors decades ago, vectorization for SIMD architectures has raised new issues due to several fundamental differences between the two architectures. See, e.g., REN, Gang, et al. A Preliminary Study on the Vectorization of Multimedia Applications. In 16th International Workshop of Languages and Compilers for Parallel Computing. October 2003. To distinguish between the two types of vectorization, we refer to the latter as simdization. One such fundamental difference comes from the memory unit. The memory unit of a typical SIMD processor bears more resemblance to that of a wide scalar processor than to that of a traditional vector processor. In the ALTIVEC instruction set found on certain POWERPC microprocessors (produced by International Business Machines Corporation and Motorola, Inc.), for example, a load instruction loads 16-byte contiguous memory from 16-byte aligned memory, ignoring the last 4 bits of the memory address in the instruction. The same applies to store instructions. In this paper, architectures with alignment constraints refer to machines that support only loads and stores of register-length aligned memory.
There has been a recent spike of interest in compiler techniques to automatically extract SIMD parallelism from programs. See, e.g., LARSEN, Samuel, et al. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of SIGPLAN Conference on Programming Language Design and Implementation. June 2000, pp. 145-156; BIK, Aart, et al. Automatic Intra-Register Vectorization for the Intel Architecture. Int. J. of Parallel Programming. April 2002, vol. 30, no. 2, pp. 65-98; KRALL, Andreas, et al. Compilation Techniques for Multimedia Processors. Int. J. of Parallel Programming. August 2000, vol. 28, no. 4, pp. 347-361; SRERAMAN, N., et al. A Vectorizing Compiler for Multimedia Extensions. Int. J. of Parallel Programming, August 2000, vol. 28, no. 4, pp. 363-400; LEE, Corinna G., et al. Simple Vector Microprocessors for Multimedia Applications. In Proceedings of International Symposium on Microarchitecture. 1998, pp. 25-36; and NAISHLOS, Dorit, et al. Vectorizing for a SIMD DSP Architecture. In Proceedings of International Conference on Compilers, Architectures, and Synthesis for Embedded Systems. October 2003, pp. 2-11. This upsurge was driven by the increasing prevalence of SIMD architectures in multimedia processors. Two principal techniques have been used, the traditional loop-based vectorization pioneered for vector supercomputers (e.g., ALLEN, John Randal, et al. Automatic Translation of Fortran Programs to Vector Form. ACM Transactions on Programming Languages and Systems. October 1987, vol. 4, pp. 491-542; and ZIMA, Hans, et al. Supercompilers for Parallel and Vector Computers. Reading, Mass.: Addison-Wesley/ACM Press, 1990. ISBN 0201175606.) and the unroll-and-pack approach first proposed by Larsen and Amarasinghe in LARSEN, Samuel, et al. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of SIGPLAN Conference on Programming Language Design and Implementation. June 2000, pp. 145-156.
The alignment constraints of SIMD memory units present a great challenge to automatic simdization. Consider the code fragment in FIG. 1 where integer arrays a, b, and c are aligned (An aligned reference means that the desired data reside at an address that is a multiple of the vector register size.). Although this loop is easily vectorizable for traditional vector processors, it is non-trivial to simdize it for SIMD architectures with alignment constraints. Hence, the most commonly used policy today is to simdize a loop only if all memory references in the loop are aligned.
A very extensive discussion of alignment considerations is provided by LARSON, Samuel, et al. Increasing and Detecting Memory Address Congruence. In Proceedings of 11th International Conference on Parallel Architectures and Compilation Techniques. September 2002. However, LARSON is concerned with the detection of memory alignments and with techniques to increase the number of aligned references in a loop, whereas our work focuses on generating optimized SIMD codes in the presence of misaligned references. The two approaches are complementary. The use of loop peeling to align accesses was discussed in LARSON as well as in the aforementioned BIK reference. The loop peeling scheme is equivalent to the eager-shift policy with the restriction that all memory references in the loop must have the same misalignment. Even under this condition, our scheme has the advantage of generating simdized prologue and epilogue, which is the by-product of peeling from the simdized loop.
Direct code generation for misaligned references have been discussed by several prior works. The vectorization of misaligned loads and stores using the VIS instruction set is described in CHEONG, Gerald, et al. An Optimizer for Multimedia Instruction Sets. In Second SUIF Compiler Workshop. August 1997. The aforementioned BIK, et al. reference described a specific code sequence of aligned loads and shuffle to load memory references that cross cache line boundaries, which is implemented in Intel's compiler for SSE2. However, their method is not discussed in the context of general misalignment handling.
The VAST compiler, a commercial product by Crescent Bay Software, has some limited ability to simdize loops with multiple misaligned references, unknown loop bounds, and runtime alignments, and exploit the reuse when aligning a steam of contiguous memory. The VAST compiler, however, produces less than optimal simdized code, as its highly generalized scheme for handling misalignment can produce additional compilation overhead.
An interesting simdization scheme using indirect register accesses is discussed in the aforementioned NAISHLOS, et al. reference. However, their method is specific to the eLite processor, which supports more advanced vector operations (such as gather and scatter operations) than are available on typical MME processors. In SHIN, Jaewook, et al. Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques. September 2002, register packing and shifting instructions were used to exploit temporal and spatial reuse in vector registers. However, their work does not address alignment handling.
Another work that is of note, but which is in the area of compiling for distributed memory systems as opposed to SIMD architectures, is CHATTERJEE, Siddhartha, et al. Modeling Data-Parallel Programs with the Alignment-Distribution Graph. J. of Programming Languages. 1994, vol. 2, no. 3, pp. 227-258.
Previous work in the area of vectorization and simdization of loop code, however, has failed to address the issue of what are referred to herein as “heterogeneous loops.” A heterogenous loop contains different statements that can be efficiently executed on either a scalar processor, a SIMD or other vector processor, or both. For example, on the POWERPC 970 processor, double-precision floating point operations can only be executed as scalar operations, as the vector processing unit does not support double-precision operations. Most fixed-point operations can be executed on either the vector unit or a scalar unit, but the choice as to whether to vectorize a given loop or not is not always straightforward. For example, on the POWERPC 970, 32-bit fixed-point multiplication can be executed on the vector unit using a sequence of SIMD instructions, but processing four elements at a time, while the scalar unit requires only one instruction, but can process only one element at a time.
Existing compilers, such as the aforementioned VAST compiler, do not effectively address the scalar/vector tradeoff associated with heterogeneous loops. For example, the VAST compiler, when faced with a loop containing some operations that can be executed on a vector unit and some operations that cannot be executed on a vector unit, will simply perform no vectorization of the loop at all.
Another approach that has been proposed is to split such a loop into two loops, one with operations to be executed on the vector unit, which is subsequently simdized, and the other with operations to be executed on scalar units. There are two drawbacks to this approach, however. First, splitting the loop creates more loops with shorter loop bodies. This makes it more difficult to schedule instruction execution so as to provide for instruction-level parallelism. Second, splitting the loop results in separate loops with either all vector instructions or all scalar instructions. This means that when the vector loop is executed, the scalar units of the processor may sit idle, and vice versa.
Thus, there is a need for a compilation scheme to produce optimized code for heterogeneous loops. The present invention provides a solution to these and other problems, and offers other advantages over previous solutions.