Dependency analysis is a set of legality rules, which is a constraint on the order of memory references, that is determined by a compiler. For example, memory reference B is considered to be dependent on memory reference A, if B follows A in a serial execution of a program, or if both A and B reference the same memory location. There are memory reference dependencies which constrain memory references to occur in the order required by the semantics of the program language. Memory reference dependencies include an output dependence which constrains the order in which two assigns occur, an anti-dependence which constrains the use to precede an assign, a flow dependence which constrains an assign to precede the use, and an input dependence which constrains the order in which two uses occur. Other dependencies include control dependence which constrains an operation to follow a test that determines whether the flow of control will make it possible for the operation to be executed, and an operation dependence which constrains an operation to follow its inputs.
FIG. 1 illustrates the general structure of a typical compilation environment 10 wherein there is source file 11 that comprises a program written by some user in some high level language. File 11 is processed by compiler 12 into object file 13, which typically consists of a sequence of machine instructions that are the result of translating the high level source statements in source file 11. Object file 13 is then processed by linker program 14, which combines object file 13 with other object files 15, which resulted from other source files (not shown), to produce executable program 16. Executable program 16 is then eligible for direct execution on computer 17. Thus, the program reads some input 18 and performs some processing, and generates some output 19. The dependency analysis and loop optimizations are typically implemented as part of the compiler shown in FIG. 1.
FIG. 2 depicts a view of the internal structure of optimizing version of compiler 12 of FIG. 1. This type of compiler not only translates source files 11 into object file 13, but also attempts to improve the run time performance of the created object file. The compiler begins with source file 11. The source file is read in and checked for syntax errors or semantic errors by front end 21 of the compiler front end 21. Assuming that there are no errors, the compilation proceeds with front end 21 generating intermediate representation 22. Optimizer 23 attempts to improve the structure of intermediate representation 22 and thereby increase run-time performance, by performing transformations that would allow the code to be executed faster. The final step involves generating object file 13, which is typically done at the very back end of the compiler by object file generator 24.
FIG. 3 depicts the internal structure of optimizer 23 that is shown in FIG. 2. The optimizer begins with the unoptimized low level intermediate representation 31 for each procedure being compiled and generates the optimized intermediate representation 35 for each procedure. The first phase is analysis 32 of the intermediate representation to determine which optimizations can be performed. This includes recognizing the loop structure in the procedure that is being compiled, and performing dependency analysis. The second phase is to perform various optimizations on the code, including loop optimizations, and update the distance vector where possible. When it is no longer possible to perform further optimization, because of an illegality, or that the distance vectors have become un-manageable, or optimization has been completed, then the optimizer performs post-optimization phases 34 such as instruction scheduling and register allocation. The result is optimized intermediate representation 35 which is then processed by object file generator 24 into compiled object code 13.
Each node in the internal representation of a program represents potentially many different run-time references. The family of dependencies from node NA to node NB is the set of dependencies between potential run-time references to node NA and run-time references to node NB. Arcs that the compiler draws between nodes are representations of such families. For example, for the loop shown in CODE BLOCK 1, the compiler draws and labels an arc from x(i+1)=to x(i) to represent the family of dependencies shown in CODE BLOCK 2. ##EQU1##
A memory reference occurs within some number of surrounding loops. An iteration vector identifies on which iterations of those loops that execution of a particular node causes a specific memory reference. If there are n surrounding loops for the node, the iteration vector for a reference is a n-tuple. Each component of the n-tuple is the iteration number for the corresponding surrounding loop, wherein the outermost loop corresponds to the first component, and so on. For example, CODE BLOCK 3 depicts a program fragment with a 3-deep, nested loop, specifically Do i, Do j, Do k, wherein each loop steps by one and the upper bound limit for each loop is n. The loop body consists of a single reference to a three dimensional array element for an array called x. The first dimension subscript is i+1, the second is j+2, and the third is k+3. The iteration vector for the assignment x(3,7,4)=is (1,4,0) for a 0-origin iteration vector. For a 1-origin iteration vector, the vector would be (2,5,1). Using the 0-origin will encompass all possible programming languages. FORTRAN, which uses 1-origin would be skewed appropriately. ##EQU2##
A distance vector is the difference between two iteration vectors. For example, CODE BLOCK 4 depicts a program fragment with a 2-deep, nested loop, specifically Do i, Do j, wherein each loop steps by one and the upper bound limit for each loop is n. The loop body consists of a single reference to an assignment to a two dimension array y, wherein y(i+3, j-1)=y(ij). The iteration vectors for y(4,4)=and y(4,4), are (0,4) and (3,3), respectively. The distance vector is (3,3) minus (0,4) which is (3,-1). For a sign convention, the iteration vector for the destination is first in the subtraction. The sign convention is not important so long as it is consistently applied throughout the analysis. Please note that it is not always true that distance vectors can be computed by subtracting subscripts, typically for loops with a non-unit stride.
A direction vector is the sign of the distance vector, in other words, a direction vector is realized when each component of the distance vector is replaced with a +1,0, and/or -1. For example, the distance vector of (3,-1), has a corresponding direction vector of (+1,-1). Direction vectors are usually written symbolically, rather than numerically, with "&lt;" for a +1, a "=" for a 0, and a "&gt;" for a -1. Thus, the distance vector of (3,-1) is represented by (&lt;,&gt;). The "&lt;" symbol indicates from a lesser to a greater iteration, while the "&gt;" symbol indicates from a greater to a lesser iteration.
Dependence families are often uniform, meaning that all dependencies in the family have the same distance and direction vectors. Referring to the loop of CODE BLOCK 4, the dependence family for y(i+3, j-1)=to y(i,j) is uniform. Therefore, (3,-1) can be considered a family distance vector, and (&lt;,&gt;) as the family direction vector.
The compilers of super computers use optimization techniques that actually transform the program code inputted by the user to better utilize the architecture of the machines. For example, for a parallel processing machine, instead of performing an operation sequentially, such that the machine is waiting for step two to be complete before proceeding to step three, the machine will re-write the program to allow steps two and three to be processed simultaneously. The compiler will perform a dependency analysis to determine whether this type of transformation is legal. If the transformation is legal, the user will get a significant increase in speed, such as five times, plus a guarantee that the results are correct from the dependency analysis. If the transformation is illegal, the compiler will run the code as is and not transform it.
One particular area that the compiler can optimize is iterative constructs or loops. For example, a section of a program is written to perform a matrix multiply where the different rows and columns of the matrix are multiplied together. The compiler may determine that by using various transformations, that section of code can be re-written to run faster and still achieve the correct results.
Before performing the transformation, the compiler will perform a dependency analysis for every loop nest in the program, and for every reference pair in each nest, to determine the iteration space and the memory constraints of the references during their lifetimes in the surrounding loops. That information is typically recorded in compilers as distance (or dependence) vectors and direction vectors. The length of the distance vector is the number of common loops that the references span, and each element in the vector provides a memory dependence distance. The direction vector can be used if the distance is not immediately determinable from the iteration vectors of the two references.
One loop transformation that is commonly performed by the compiler uses distance vectors and is referred to as loop strip-mining or loop blocking. This loop transformation involves a reduction in the overhead of a loop by running a loop in sections. This allows more reuse of cache memory when processing the loop. Typically, machines have a cache memory system, for example, L2 cache, that is different than their registers, which is located very close to the processor, such that latencies are less than going to the more remote RAM memory. However, there is a penalty incurred in reloading the cache, in terms of time and performance. As the penalty is expensive, consequently it is advantageous to decrease the number of cache loads.
Strip-mining will decrease the number of cache loads by transforming the loop so as to fit within the cache. The loop is transformed by adding an additional loop, such that the inner loop would access each cache line, and the outer loop would access the entire cache. This type of transformation is commonly used in super computers from vectors processing through parallel processing, to minimize cache memory latencies. The transformation introducing an additional loop that was not present in the original program code, and changes the subscripts for array elements in the transformed loop to be functions of both the original loop and the new loop.
An example of strip-mining or blocking shown in CODE BLOCKS 5 and 6, which respectively depict a one loop nest that becomes a two loop nest after blocking or strip-mining. The compiler would calculate the dependencies once for the entire program for every loop nest. This is a very expensive process which involves practically a pair-wise algorithm, meaning, that for X references there would be approximately X.sup.2 possibilities. ##EQU3##
CODE BLOCK 5 depicts a simple loop, wherein Do i=1 to 10, a(i+3)=, and there is a use of a(i) within the loop. The compiler may determine that it would be more efficient to run this loop in sections of 4, based upon characteristics of the computer system, particularly the size of the cache, as compared with the size of the loop. Thus, an outer loop is added, and the terms of the inner loop are modified as shown in CODE BLOCK 6, wherein the original code is transformed into Do j=1 to 10 by 4 and Do i=j to min of (j+4-1, 10), with a(i+3) is=and a(i). Thus, j will have the values of 1, 5, and 9. The values of i will change with j, such that i will have the values 1-4 when j=1, i will have the values 5-8 when j =5, and i will have the values 9-10 when j=9. Thus, the step size has been changed from one to 4, such that the inner loop runs strips of the outer loop up to a maximum of either 4, which is the strip size length, or the remainder if the loop iteration count is not evenly divisible without remainder by the strip size. The outer loop is referred to as the section loop and the inner loop is referred to as the element loop.
In a prior art compilers, after completing this transformation the compilers cannot do much more with this loop, and consequently they cannot optimize any loop after it has been blocked. This is because the subscripts have become obtuse and complicated. What used to be simple subscripts are now complicated by the additional loop. Moreover, the additional loop did not exist in the original code, meaning that the original distance vector would have to be re-sized or re-calculated. This is further complicated in that for dependence analysis to be run, every loop has to be normalized, meaning that every loop has to be unity based, where the induction variable begins at 1 and steps by 1. ##EQU4##
For example, the loop nest shown CODE BLOCK 6 is normalized as shown in CODE BLOCKS 7 and 8. The outer loop is normalized first, as shown in CODE BLOCK 7, and then the inner loop is normalized as shown in CODE BLOCK 8. The loop nest shown in CODE BLOCK 8 would be the input to the dependency analysis of the prior art compiler. The subscript expressions are couple-subscriptions, meaning that every subscript is now a function of both the inner and the outer loop induction variables. This makes solving what was a simple subtraction to determine the distance vector into a complex calculation which includes having to solve for more than one unknown. Also, what had been a constant iteration space or dependence distance, which was the difference of A(i+3) and A(i), or 3, is now variable, and is dependent upon the values of the induction variables, i and j.
A major problem in the prior art is that compilers will perform blocking or strip-mining as the last loop transformation. This is because it is difficult to test the legality of any optimization after this point, since loop bounds would then contain min and mod functions. Also, subscripts that were functions of a single loop induction variable are now functions of multiple loops. A prior art compiler, even if it could compute the dependencies after this transformation, would arrive with an overly conservative result due to so many new variables in the equation, and would incur a large expense for re-computing them.
Moreover, if the compiler were to re-calculate the dependencies after every transformation, it is more than likely that information will be lost, because on each subsequent transformation, the loop bounds subscripts are becoming complicated from the semantics of each transformation. Prior art compilers just compute the dependencies once up front, and then perform a static ordering of the optimization transformations. The compilers halt optimizations when they encounter a single subscript that is a function of more than one loop induction variable. At this point, the compiler would assign registers and emit machine code. The distance and directions vectors, if they were to be calculated from the coupled-subscript expressions in CODE BLOCK 8, would appear as follows: ##EQU5##
Note that prior art compilers do not compute the results shown in CODE BLOCK 9 because of the overhead costs of normalization and the difficulty in considering coupled-subscript expressions. CODE BLOCK 9 is provided for comparison with the later described invention.
Therefore, there is a need in the art that would allow a compiler to update the distance vectors, permit non-static ordering of the optimizations, avoid incurring the overhead of normalizing and re-analyzing the dependencies, avoid having coupled subscript expressions, and allow the interchanging of loops, while maintaining simplicity and exactness throughout the compilation.