1. Technical Field
The present application relates generally to an improved data processing system and method. More specifically, the present application is directed to a system and method for domain stretching for an advanced dual representation polyhedral loop transformation framework.
2. Description of Related Art
Generating computer code that is efficiently processed (i.e., “optimized”) is one of the most important goals in software design and execution. Computer code which performs the desired function accurately and reliably but too slowly (i.e., code which is not optimized) is often discarded or unused by computer users.
As those of ordinary skill in the art are aware, most source code (i.e., that code which is a human readable form) is typically converted into object code, and thereafter an executable application, by use of a compiler and a linker. The executable application is in a form and language that is machine readable (i.e., capable of being interpreted and executed by a computer). Other languages, such as Java available from Sun Microsystems, Inc. of California, USA, may be in source code form that is, on execution, transformed into a form understood by a computer system which then executes the transformed instructions. In any case, the source code, when transformed into a form capable of being understood and executed by a computer system, is frequently optimized. That is, a transformation is performed such that the instructions are performed more efficiently (i.e., optimized) and, hopefully, without any undue delay.
One common structure found in source code that is optimized, during the compilation process to transform source code into executable code, is the loop. Loops are used to repeat one or more operations or instructions. Loops may be provided as single, non-nested loops, or nested loops, i.e. loops within loops. For example, an array may be used to store the purchase price of individual articles (e.g., where the ith element in the array A is denoted, in Fortran, as A(i)—other similar notations are used in other languages) generate a single instruction to add each of the purchase prices together (e.g., sum=A(1)+A(2)+ . . . +A(n)). This however would take the programmer some time to code and is not easily adapted to the situation where the computer programmer does not know, at development time, the number of articles in the array. That is, when the number of elements in the array can only be determined at run time (i.e., during execution). Accordingly, the loop was developed to repeat an operation (e.g., sum=sum+A(i))) where the induction variable, i, is changed for each iteration. Other forms of loops are known and are equally applicable.
However, when the instructions of a loop are transformed into machine readable code (e.g., executable code), the executed instructions may not be processed efficiently. For the example above, some computer systems may require that the processor fetch from memory, rather than from a register or cache memory, the various elements of the array “A”. Fetching data from memory requires the processor to wait while the data is retrieved thereby increasing the latency of the program execution. Also, while loops may be an efficient way to write certain repetitive source code operations, a loop does insert additional operations that would not be present if the repetitive operations were replicated. These additional operations (e.g., branching operations) are considered to be the loop “overhead”.
To address some of the inefficiencies in processing loops, various optimization techniques have been created and applied. Examples of these various optimization techniques include loop inversion, loop skewing, loop tiling, unrolling and jamming, and the like. For example, with unrolling and jamming (hereinafter “unrolling”) a portion of the loop is replicated, or “unrolled,” and the replicated portions are inserted, or “jammed,” into the code. Typically, when the unroll and jam loop transformation technique is applied to the outer loop of a nested loop pair, the outer loop's induction variable (e.g., “i”) is advanced only a few times (the number of times being governed by a parameter referred to as the unroll factor—UF) rather than completely during the unrolling portion of this optimization technique. During the jamming portion of this technique, the inner loop would be replicated “UF” times. Persons of ordinary skill in the art will appreciate that the replicated loop bodies are not identical but only similar. In the replicated loop bodies, portions of the loop bodies which use the induction of the outer loop will be advanced as required (e.g., if the loop body included reference to array element A(i), where “i” is the outer loop induction variable, a replicated loop body would include reference to the next required array element—A(i+1)). The unroll and jam technique effectively reorders the calculations being performed in the nested loop.
Typically, such optimizations are performed with regard to a compiler's intermediate representation of the source code, e.g., an abstract syntax tree. The abstract syntax tree is a finite, labeled, directed tree, where the internal nodes are labeled by operators, and the leaf nodes represent the operands of the operators. The abstract syntax tree (AST) is used in a parser as an intermediate between a parse tree and a data structure, the latter of which is often used as a compiler or interpreter's internal representation of a computer program while it is being optimized and from which code generation is performed. ASTs are usually not appropriate for complex program restructuring since, while simple optimizations such as constant folding or scalar replacement may be achieved without hard modifications of the data structures, more complex transformations such as loop inversion, skewing, tiling, etc., modify the execution order, which is far away from the syntax. See Cedric Bastoul, “Code Generation in the Polyhedral Model is Easier Than You Think,” PACT '13 IEEE International Conference on Parallel Architecture and Compilation Techniques, pages 7-18, Juan-les-Pins, September 2004, which is hereby incorporated by reference.
The polyhedral model, which is based on a linear algebraic representation of programs and transformations, was developed to address this issue. See Bastoul et al. “Putting Polyhedral Loop Transformations to Work,” LCPC'16 International Workshop on Languages and Compilers for Parallel Computers, LNCS 2958, pages 209-225, College Station, October 2003, which is hereby incorporated by reference. The polyhedral model is basically a plugin to the conventional compilation process. It starts from the AST by translating the program parts that fit the model into a linear-algebraic representation. A new execution order is then selected by using a reordering function, e.g., using a schedule, placement or chunking function. Then, in a code generation step, an AST or new source code is returned that implements the execution order implied by the reordering function.
As an example of the polyhedral transformation consider the syntactic form of a polynomial multiplication kernel as represented in FIG. 1A. See Vasilache et al., “Polyhedral Code Generation in the Real World,” INRIA, 2006, available at the INRIA website. This example is concerned only with the control aspects of the program source code with the two computational statements (array assignments) being referred to herein by their names S1 and S2. The polyhedral transformation model considers statement instances. For each statement, the iteration domain where every statement instance belongs is considered. The iteration domains are described using affine constraints that can be extracted from the program control. For example, the iteration domain of statement S1, referred to as DS1, is the set of values (i) such that 2≦i≦n. As shown in FIG. 1B, a matrix representation is used to represent such constraints: A*x+Ap*p≧0, where A is the iteration matrix, x is the iteration vector (composed of the loop counters), Ap is the parameter matrix and p is the parameter vector (composed of the unknown constants and the scalar 1). Thus, in the example of FIGS. 1A and 1B, DS1 is characterized by:
                    [                                            1                                                                          -                1                                                    ]            ·              (        i        )              +                  [                                            0                                                      -                2                                                                        1                                      0                                      ]            ·              (                                            n                                                          1                                      )              ≥  0.
In this framework, a transformation is a set of affine scheduling functions written θ(x)=T*x+Tp*p. Each statement has its own scheduling function which maps each runtime statement instance to a logical execution time. In the polynomial multiplication example of FIGS. 1A and 1B, an optimizer may notice a locality problem and discover a good data reuse potential over array z, then suggest θS1(i)=(i) and
            θ              S        ⁢                                  ⁢        2              ⁡          (                                    i                                                j                              )        =      (          i      +      j      +      1        )  to achieve better locality. See Bastoul et al., “Improving Data Locality by Chunking,” CC '12 Intl. Conf. on Compiler Construction, LNCS 2622, pages 320-335, Warsaw, April 2003, which is hereby incorporated by reference, for a method to compute such functions. The intuition behind such transformation is to execute consecutively the instances of S2 having the same i+j value (thus accessing the same array element of z) and to ensure that the initialization of each element is executed by S1 just before the first instance of S2 referring to this element. A transformation is applied in the polyhedral model by using the transformation formula shown in FIG. 1C, where t is the time-vector, i.e. the vector of the scheduling dimensions. The resulting polyhedra, for the example, is shown in FIG. 1D with the additional dimension t.
Once the transformation has been applied in the polyhedral model, one needs to generate the target code. A syntax tree construction scheme, which may consist of a recursive application of domain projections and separations, such as described in Bastoul “Code Generation in the Polyhedral Model is Easier Than You Think” and Quillere et al., “Generation of Efficient Nested Loops from Polyhedra,” International Journal of Parallel Programming, 28(5):469-496, October 2000, is applied to the transformation. The final code is deduced from the set of constraints describing the polyhedra attached to each node in the AST.
In the above example, the first step is a projection onto the first dimension t, followed by a separation into disjoint polyhedra as shown on the top of FIG. 2A. This builds the first loop level of the target code (the loops with iterator t shown in FIG. 2B). The same process is applied onto the first two dimensions (on the bottom of FIG. 2A) to build the second loop level, and so on. The final code is shown in FIG. 2B. Note that the separation step for two polyhedra needs three operations: DS1−DS2, DS2−DS1, and DS2∩DS1, thus for n statements, the worst case complexity is 3n.
The polyhedral loop transformation-based approach to compiler optimization addresses several weaknesses of the traditional loop-based approaches to source code optimization. The polyhedral loop transformation approach addresses non-perfectly nested loops, has a precise instant-wise representation of data dependencies, and naturally supports compositions of complex transformations. As a result, it can detect more parallelism and exploit more data locality for more complex loop nests than the traditional loop-based approaches.
However, while the polyhedral loop transformation-based approach provides improved optimization of source code during the compilation process, it is not more widely used because of two main drawbacks. First, the code that is generated from the polyhedral representation is not always optimal with regard to some optimization criteria. This means that code that has excellent properties, such as data-parallelism (meaning that the work within a given loop or set of loops is data parallel and thus can be computed in parallel by possibly multiple threads on possibly multiple processors) and data locality (meaning the data needed to compute a specific amount of work generated by a given loop or set of loops often reuses the same set of data or a set of data that is collocated in memory) may be slowed down because of sub-par scalar performance (meaning that the generated code has high overhead due to unnecessary checks, branch, loop bound computations, and/or any other overheads) and/or unnecessary code bloat, i.e. an increase in the size of the code due to compiler optimizations being run on the source code. Second, transformations applied to a statement by current polyhedral loop transformation approaches necessarily touch all instances of a given statement. This means that, for example, it is hard to express parallelism for a statement that is partially parallel, i.e. a statement that is parallel in all but a few boundary instances. Similarly, for data locality enhancement, requiring that tiling must be performed on all instances of a statement, including the rarely executed boundary conditions, results in unnecessary code bloat as well as increased loop overhead. Tiling is a loop optimization that aims at increasing the data locality of a computation by cutting a large set of computation, e.g. a 2 dimensional computation iterating over 0-1023 times 0-1023 by a smaller set of computation on a smaller tile, e.g. 0-63×0-63, where once the first tile is completed, one may then iterate over the second tile, e.g. 0-63×64-127, with this operation repeating with subsequent tiles until all of the original computation is completed.