1. Technical Field
The present invention relates in general to a system and method for vectorizing loop code for execution on Single Instruction Multiple Datapath (SIMD) architectures that impose strict alignment constraints on the data.
2. Description of the Related Art
Multimedia extensions (MMEs) have become one of the most popular additions to general-purpose microprocessors. Existing multimedia extensions can be characterized as Single Instruction Multiple Datapath (SIMD) units that support packed fixed-length vectors. The traditional programming model for multimedia extensions has been explicit vector programming using either (in-line) assembly or intrinsic functions embedded in a high-level programming language. Explicit vector programming is time-consuming and error-prone. A promising alternative is to exploit vectorization technology to automatically generate SIMD codes from programs written in standard high-level languages.
Although vectorization has been studied extensively for traditional vector processors decades ago, vectorization for SIMD architectures has raised new issues due to several fundamental differences between the two architectures. See, e.g., REN, Gang, et al. A Preliminary Study on the Vectorization of Multimedia Applications. In 16th International Workshop of Languages and Compilers for Parallel Computing. October 2003. To distinguish between the two types of vectorization, we refer to the latter as simdization. One such fundamental difference comes from the memory unit. The memory unit of a typical SIMD processor bears more resemblance to that of a wide scalar processor than to that of a traditional vector processor. In the ALTIVEC instruction set found on certain POWERPC microprocessors (produced by International Business Machines Corporation and Motorola, Inc.), for example, a load instruction loads 16-byte contiguous memory from 16-byte aligned memory, ignoring the last 4 bits of the memory address in the instruction. The same applies to store instructions. In this paper, architectures with alignment constraints refer to machines that support only loads and stores of register-length aligned memory.
There has been a recent spike of interest in compiler techniques to automatically extract SIMD parallelism from programs. See, e.g., LARSEN, Samuel, et al. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of SIGPLAN Conference on Programming Language Design and Implementation. June 2000, pp. 145-156; BIK, Aart, et al. Automatic Intra-Register Vectorization for the Intel Architecture. Int. J. of Parallel Programming. April 2002, vol. 30, no. 2, pp. 65-98; KRALL, Andreas, et al. Compilation Techniques for Multimedia Processors. Int. J. of Parallel Programming. August 2000, vol. 28, no. 4, pp. 347-361; SRERAMAN, N., et al. A Vectorizing Compiler for Multimedia Extensions. Int. J. of Parallel Programming, August 2000, vol. 28, no. 4, pp. 363-400; LEE, Corinna G., et al. Simple Vector Microprocessors for Multimedia Applications. In Proceedings of International Symposium on Microarchitecture. 1998, pp. 25-36; and NAISHLOS, Dorit, et al. Vectorizing for a SIMD DSP Architecture. In Proceedings of International Conference on Compilers, Architectures, and Synthesis for Embedded Systems. October 2003, pp. 2-11. This upsurge was driven by the increasing prevalence of SIMD architectures in multimedia processors. Two principal techniques have been used, the traditional loop-based vectorization pioneered for vector supercomputers (e.g., ALLEN, John Randal, et al. Automatic Translation of Fortran Programs to Vector Form. ACM Transactions on Programming Languages and Systems. October 1987, vol. 4, pp. 491-542; and ZIMA, Hans, et al. Supercompilers for Parallel and Vector Computers. Reading, Mass.: Addison-Wesley/ACM Press, 1990. ISBN 0201175606.) and the unroll-and-pack approach first proposed by Larsen and Amarasinghe in LARSEN, Samuel, et al. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of SIGPLAN Conference on Programming Language Design and Implementation. June 2000, pp. 145-156.
The simplest form of contiguous memory stream can be identified as a single stride-one memory access, for instance, a[i], where i is the counter of the surrounding loop and increments by 1. However, there are cases in which a contiguous memory stream is formed by a combination of non stride-one memory accesses. FIGS. 1A-1D provide four examples of such scenarios written in a C-like pseudocode.
In FIG. 1A, a loop that iterates over an array of structured data types or “structures” (e.g., “structs” in C or “records” in Pascal), where each of the structures has two fields. Although language syntax requires that a separate assignment statement be written for each field (thus creating two non-stride-one accesses in lines 100 and 102), the actual object code generated by the program fragment shown in FIG. 1A will result in assignments to a contiguous memory stream made up of elements of type “int,” since each structure is simply a pair of adjacently-stored integers.
In FIG. 1B, a loop having mixed-stride memory accesses is depicted. Lines 104 and 106 access memory array “a” with two stride-two accesses, while line 108 accesses memory array “b” with a single stride-one access. Both the pair of lines 104 and 106 and line 108, however, operate over contiguous streams of memory over the iteration of the loop.
FIG. 1C depicts what may be described as a manually-unrolled loop. Lines 110, 112, and 114 are each stride-three memory accesses. However, over the iteration of the loop, lines 110, 112, and 114 operate over a contiguous stream of memory. The loop depicted in FIG. 1C may be characterized as being “manually-unrolled,” since each of lines 110, 112, and 114 represent three iterations of a semantically-equivalent loop. For example, the loop in FIG. 1C could have been written asfor(int i=0; i<N; i++) a[i]=(i%3)+1;where “%” represents the “modulo” or “remainder” operator, as in the C programming language. Loop unrolling is commonly used to speed up program code written for pipelined processors, where branch instructions may incur substantial performance penalties.
FIG. 1D depicts a nested pair of loops in which the inner loop has a short, known trip count. Although line 118, as written, relies on two index variables, “i” and “j,” because the trip count of the inner loop is known, line 118 could be unrolled into a semantically equivalent loop of the form shown in FIG. 1C.
A number of techniques have been proposed to allow multiple non-stride-one accesses over contiguous memory streams to be aggregated into stride-one memory accesses. Among these are “loop rerolling,” “loop collapsing,” and “loop unroll-and-pack.”
“Loop rerolling,” utilized in the aforementioned VAST compiler, rerolls isomorphic statements with adjacent memory accesses into a loop where all accesses in the statements are stride-one. A major drawback of this approach is that loop rerolling introduces a new innermost loop to the enclosing loop net, thus making the original innermost loop not innermost any more. Since simdization usually happens at the innermost loop only, rerolling reduces the scope of simdization. Further, loop rerolling requires the presence of a loop counter for the innermost loop, so it cannot reroll loops such as in FIG. 1A, where the individual memory accesses are addressed by fields in a structure, rather than by an array index.
“Loop collapsing” is described in SMITH, Kevin et al. Support for the Intel Pentium 4 Processor with Hyper-Threading Technology in Intel 8.0 Compilers. In Intel Technology Journal, Feb. 18, 2004. In “loop collapsing,” isomorphic statements are first virtually rerolled into a loop. Then, the rerolled loop is collapsed into the innermost surrounding loop. A drawback of this technique is that loop collapsing must virtually reroll all statements in the original innermost loop. Thus, loop collapsing cannot collapse loops such as in FIG. 1B, where there are mixed-stride accesses.
“Loop unroll-and-pack” is described in LARSEN, Samuel, et al. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of SIGPLAN Conference on Programming Language Design and Implementation. June 2000, pp. 145-156. This technique finds SIMD parallelism within a basic block (and is thus limited to an iteration of a loop) by packing isomorphic operations on adjacent memory accesses to SIMD instructions. A drawback of this approach is that it does not explore the contiguous memory stream across iterations. Thus, it can be very inefficient in handling misaligned memory accesses. Another drawback of this approach is that it may require additional loop unrolling to be effective.
In published patent application US 20040006667 (BIK et al.) 2004-1-8, a technique is disclosed whereby non-unit-stride memory accesses are packed into vectors using “gather” and “scatter” instructions. Gather and scatter instructions are commonly found on traditional vector and SIMD architectures and are used to pack and unpack values to and from a vector representation in the processor. The inclusion of additional gather, scatter, and shuffle instructions into the loop body, as in the BIK application, limits the performance attainable by this method. Moreover, the BIK application's technique requires that the vectors in question be of a size that is a multiple of the physical vector size, which may force a compiler to perform an unnecessary degree of loop unrolling. Finally, the BIK reference fails to recognize that certain groups of adjacent non-unit-stride memory accesses may in fact be equivalent to an iterated unit-stride memory access; this causes the BIK simdization method to introduce unnecessary gather, scatter, and shuffle instructions in certain instances.
Thus, what is needed is an efficient, general-purpose scheme for aggregating multiple non-stride-one memory references into simdized code. The present invention provides a solution to these and other problems, and offers other advantages over previous solutions.