1. Technical Field
The present invention relates in general to a system and method for vectorizing loop code for execution on Single Instruction Multiple Datapath (SIMD) architectures that impose strict alignment constraints on the data.
2. Description of the Related Art
Multimedia extensions (MMEs) have become one of the most popular additions to general-purpose microprocessors. Existing multimedia extensions can be characterized as Single Instruction Multiple Datapath (SIMD) units that support packed fixed-length vectors. The traditional programming model for multimedia extensions has been explicit vector programming using either (in-line) assembly or intrinsic functions embedded in a high-level programming language. Explicit vector programming is time-consuming and error-prone. A promising alternative is to exploit vectorization technology to automatically generate SIMD codes from programs written in standard high-level languages.
Although vectorization has been studied extensively for traditional vector processors decades ago, vectorization for SIMD architectures has raised new issues due to several fundamental differences between the two architectures. See, e.g., Ren, Gang, et al. A Preliminary Study on the Vectorization of Multimedia Applications. In 16th International Workshop of Languages and Compilers for Parallel Computing. October 2003. To distinguish between the two types of vectorization, we refer to the latter as simdization. One such fundamental difference comes from the memory unit. The memory unit of a typical SIMD processor bears more resemblance to that of a wide scalar processor than to that of a traditional vector processor. In the VMX instruction set found on certain POWERPC microprocessors (produced by International Business Machines Corporation and Motorola, Inc.), for example, a load instruction loads 16-byte contiguous memory from 16-byte aligned memory, ignoring the last 4 bits of the memory address in the instruction. The same applies to store instructions. In this paper, architectures with alignment constraints refer to machines that support only loads and stores of register-length aligned memory.
There has been a recent spike of interest in compiler techniques to automatically extract SIMD parallelism from programs. See, e.g., LARSEN, Samuel, et al. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of SIGPLAN Conference on Programming Language Design and Implementation. June 2000, pp. 145-156; BIK, Aart, et al. Automatic Intra-Register Vectorization for the Intel Architecture. Int. J. of Parallel Programming. April 2002, vol. 30, no. 2, pp. 65-98; KRALL, Andreas, et al. Compilation Techniques for Multimedia Processors. Int. J. of Parallel Programming. August 2000, vol. 28, no. 4, pp. 347-361; SRERMAN, N., et al. A Vectorizing Compiler for Multimedia Extensions. Int. J. of Parallel Programming, August 2000, vol. 28, no. 4, pp. 363-400; LEE, Corinna G., et al. Simple Vector Microprocessors for Multimedia Applications. In Proceedings of International Symposium on Microarchitecture. 1998, pp. 25-36; and NAISHLOS, Dorit, et al. Vectorizing for a SIMD DSP Architecture. In Proceedings of International Conference on Compilers, Artchitectures, and Synthesis for Embedded Systems. October 2003, pp. 2-11. This upsurge was driven by the increasing prevalence of SIMD architectures in multimedia processors. Two principal techniques have been used, the traditional loop-based vectorization pioneered for vector supercomputers (e.g., ALLEN, John Randal, et al. Automatic Translation of Fortran Programs to Vector Form. ACM Transactions on Programming Languages and Systems. October 1987, vol. 4, pp. 491-542; and ZIMA, Hans, et al. Supercompilers for Parallel and Vector Computers. Reading, MA: Addison-Wesley/ACM Press, 1990. ISBN 0201175606.) and the unroll-and-pack approach first proposed by Larsen and Amarasinghe in LARSEN, Samuel, et al. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of SIGPLAN Conference on Programming Language Design and Implementation. June 2000, pp. 145-156.
The alignment constraints of SIMD memory units present a great challenge to automatic simdization. Consider the code fragment in FIG. 1 where integer arrays a, b, and c are aligned (An aligned reference means that the desired data reside at an address that is a multiple of the vector register size.). Although this loop is easily vectorizable for traditional vector processors, it is non-trivial to simdize it for SIMD architectures with alignment constraints. Hence, the most commonly used policy today is to simdize a loop only if all memory references in the loop are aligned.
A very extensive discussion of alignment considerations is provided by LARSON, Samuel, et al. Increasing and Detecting Memory Address Congruence. In Proceedings of 11th International Conference on Parallel Architectures and Compilation Techniques. September 2002. However, LARSON is concerned with the detection of memory alignments and with techniques to increase the number of aligned references in a loop, whereas our work focuses on generating optimized SIMD codes in the presence of misaligned references. The two approaches are complementary. The use of loop peeling to align accesses was discussed in LARSON as well as in the aforementioned BIK reference. The loop peeling scheme is equivalent to the eager-shift policy with the restriction that all memory references in the loop must have the same misalignment. Even under this condition, our scheme has the advantage of generating simdized prologue and epilogue, which is the by-product of peeling from the simdized loop.
Direct code generation for misaligned references have been discussed by several prior works. The vectorization of misaligned loads and stores using the VIS instruction set is described in CHEONG, Gerald, et al. An Optimizer for Multimedia Instruction Sets. In Second SUIF Compiler Workshop. August 1997. The aforementioned BIK, et al. reference described a specific code sequence of aligned loads and shuffle to load memory references that cross cache line boundaries, which is implemented in Intel's compiler for SSE2. However, their method is not discussed in the context of general misalignment handling.
The VAST compiler, a commercial product by Crescent Bay Software, has some limited ability to simdize loops with multiple misaligned references, unknown loop bounds, and runtime alignments, and exploit the reuse when aligning a steam of contiguous memory. The VAST compiler, however, produces less than optimal simdized code, as its highly generalized scheme for handling mis-alignment can produce additional compilation overhead.
An interesting simdization scheme using indirect register accesses is discussed in the aforementioned NAISHLOS, et al. reference. However, their method is specific to the eLite processor, which supports more advanced vector operations (such as gather and scatter operations) than are available on typical MME processors. In SHIN, Jaewook, et al. Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques. September 2002, register packing and shifting instructions were used to exploit temporal and spatial reuse in vector registers. However, their work does not address alignment handling.
Another work that is of note, but which is in the area of compiling for distributed memory systems as opposed to SIMD architectures, is CRATTERJEE, Siddhartha, et al. Modeling Data-Parallel Programs with the Alignment-Distribution Graph. J. of Programming Languages. 1994, vol. 2, no. 3, pp. 227-258.
In the incorporated U.S. patent application Ser. No. 10/862,483 (hereinafter EICHENBERGER) a generic alignment handling framework that simdizes loops with arbitrary misalignments is disclosed. According to this framework, contiguous data accessed in a loop is viewed as streams, and aligning data to satisfy alignment constraints is modeled as shifting streams. Consider, for example, the (C-language) loop in FIG. 1 where the base of arrays a, b, and C are aligned (An aligned reference means that the desired data reside at an address that is a multiple of the vector register size). The grey boxes in FIG. 2A highlight the three memory streams represented by references a[i+2], b[i+1], and c[i+3] over the lifetime of the loop. Focusing on the first value of each stream, i.e., data accessed by the i=0 loop iteration, one can see from FIG. 2A that the a[2], b[1], and c[3] values are all misaligned with respect to each others. A valid simdization requires streams involved in a computation to have matching alignments. This condition can be satisfied by realigning misaligned streams using stream shift operations. FIG. 2B shows a minimum cost simdization of the loop in FIG. 1 that involves two shifts that respectively shifts the b[i+1] and c[i+3] memory streams to the alignment of the a[i+2] memory stream. The three streams have then the same alignment, satisfying the alignment constraints of the vadd and vstore operations.
Although runtime alignment is handled in the framework of EICHENBERGER, it is not as efficient as the handling of compile time alignment. Due to code generation issues, stream shifts must be implemented as either stream shift left or stream shift right. In the presence of runtime alignment, the relative alignment of 2 streams is clearly unknown at compile time. In such cases, the approach taken in EICHENBERGER is to shift left each input memory stream to the leftmost position (register offset 0), perform the computation, and shift right the result to the store memory alignment. For example, in the loop in FIG. 1, this runtime shift policy is equivalent to the simdization shown in FIG. 2C where 3 shifts are required instead of 2, increasing the alignment overhead by 50%.
In addition, existing technologies fail to adequately address the issue of data-length conversion in the generation of vectorized code for SIMD processors, where the source or destination data streams are misaligned with respect to each other. For example, one may write a loop that adds a vector of 16-bit “short” integers to a vector of 32-bit “long” integers to obtain a result that is a vector of 32-bit integer values (e.g., the case where b is an array of short integers and a and C are arrays of long integers in the loop of FIG. 1).
Thus, what is needed is a method for automatically simdizing sequential program code into parallelized SIMD code in the presence of vector misalignments that are undefined at compile-time and where a conversion between datatypes of different lengths is needed. The present invention provides a solution to these and other problems, and offers other advantages over previous solutions.