The present invention pertains to a method for optimizing source code by a compiler so that the resulting compiled code can be run much faster. More particularly, the present invention relates to a method for unrolling two-deep loops with convex bounds and imperfectly nested code, and for unrolling arbitrarily deep nests with constant bounds and imperfectly nested code.
Computers are being used today to perform a wide variety of tasks. Many different areas of business, industry, government, education, entertainment, and most recently, the home, are tapping into the enormous and rapidly growing list of applications developed for today""s increasingly powerful computer devices. Computers have also become a key technology for communicating ideas, data, and trends between and among business professionals. These devices have become so useful and ubiquitous, it would be hard to imagine today""s society functioning without them.
Computers operate by executing programs, or a series of instructions stored in its memory. These programs, and their series of instructions, are collectively referred to as software. Software is key to utility of computers. Software is what makes the computer devices function and perform useful tasks. Efficient software makes for effective machines, whereas inefficient software makes for difficult to use, less effective, slower machines. Thus, the utility of the computer device often hinges upon the quality of the software written for the device.
Initially, software is crafted by professionals referred to as programmers or software engineers. As programs have become larger and more complex, the task of writing software has become correspondingly more difficult. As a result, programmers typically code in xe2x80x9chigh level languagesxe2x80x9d to improve productivity. The use of high level language makes the task of writing extremely long and complex programs more manageable. The completed program, however, must eventually be translated into machine executable language in order to run on a computer. The process of translating the program written in high level language into a program written in machine language is referred to as compiling. The actual translation is performed by a software program referred to as a compiler. Programmers rely upon compilers to translate their programs which are written in high level language (e.g., Basic, Fortran, C++, Pascal, etc.) into a program comprised of machine executable code, known as xe2x80x9cmachine language.xe2x80x9d
Basically, the compiler operates on the program written in high level language. The high level language program is referred to as source code. The compiler translates the source code into machine executable code. Ultimately, it is the machine executable code which will run on the computer. The speed and reliability of the executable code depends upon the sophistication of the compiler, i.e. how it xe2x80x9coptimizesxe2x80x9d. If the compiler is unsophisticated, the size of the executable code will be larger than necessary. Worse, execution speed and reliability, may also be affected. Hence, it is critical to the speed and efficiency of the program that the compiler thoroughly optimize the executable code during the translation process.
There are several different methods and techniques used to optimize source code for scientific applications, loop interchange, cache tiling or blocking, etc. The present invention pertains to an improved variation for performing an optimization technique commonly referred to as xe2x80x9cunroll and jam.xe2x80x9d With the prior art technique, perfectly nested loop nests with constant loop bounds in source code can be compiled more efficiently. A xe2x80x9cloopxe2x80x9d refers to a sequence of steps which is repeated over and over again until a condition is satisfied. Often, source code contains nested loops whereby an inner loop is executed as part of the sequence of steps within an outer loop. Nested loops, under certain circumstances, may be xe2x80x9cunrolled and jammedxe2x80x9d such that it produces an equivalent result, yet the total number of load or store operations is reduced. Load and store operations are costly because it takes a relatively long period of time to load and store data into the appropriate registers. Consequently, reducing the number of load and store operations increases the speed at which a compiled code can run.
An example is now offered to demonstrate the prior art unroll and jam operation. The following source code shows a two-deep perfectly nested loop:
Each time through the inner xe2x80x9cDOxe2x80x9d loop, the same a(i) variable is used, but a different b(j) must be loaded. Hence, a separate load operation must be performed each time the statement xe2x80x9ca(i)=a(i)+b(j)xe2x80x9d is executed. The outer loop can be unrolled as follows:
By unrolling the xe2x80x9cixe2x80x9d loop once, the same result is achieved. The difference is that the xe2x80x9cixe2x80x9d loop is now stepping by two. Note that the unrolling process described above does not result in any changes to the statements. All statements are executed in the same order as before: (1,1), (1,2), . . . (1,N), (2,1), (2,2), . . . (2,N). The purpose for unrolling the source code is to prepare it for the jamming process. It should be noted that stepping by two poses a couple of problems. First, it must be ensured that the unrolled code does not overshoot and perform an extra step. Hence, one is subtracted from the upper bounds (N1xe2x88x92i). Second, there must be a mechanism for handling the case if N1 is odd. This is accomplished by implementing a wind-down loop. The wind-down loop executes the last step in case N1 is odd.
After unrolling, the source code is optimized by applying a xe2x80x9cjammingxe2x80x9d procedure. Jamming as applied to the above example produces the following code:
By jamming the two xe2x80x9cjxe2x80x9d loops into a single loop, the computation has been changed. Instead of performing (1,1), (1,2), . . . , (1,N), (2,1), (2,2), . . . (2,N); the new process runs (1,1), (2,1), (1,2), (2,2), (3,1), (3,2), etc. Now, b(1) which was used by both (1,1) and (2,1), no longer has to wait for the N iterations of xe2x80x9cjxe2x80x9d to happen. It can be loaded for (1,1) and used without having to be reloaded for (2,1). This is due to the fact that (1,1) and (2,1) occur one right after the other. Hence, jamming effectively reduces the total number of loads. In effect, the jamming procedure combines the unrolled loops. As a result, the unrolled and jammed version runs much faster than the original source code.
However, prior art unroll and jam procedures are limited to code having both constant loop bounds and whose loops also happen to be perfectly nested as well. Both of these are very strong restrictions. First non-constant loop boundaries (e.g. in the above example, if the loop bounds for the j loop had depended on i) are frequently found in source code. Second, it is very common for these to be statements nested in the outer loop, but not loops further in.
Thus, there is a need in the prior art for a compiler that can unroll and jam the more general case of source code having variable loop boundaries and non-perfectly nested statements. The present invention provides an elegant solution to these problems in important cases by implementing an xe2x80x9couter loop unrollingxe2x80x9d process. With the present invention, source code which normally cannot be unrolled and jammed, can now be optimized by an outer loop unrolling process to reduce the total number of load and store operations. Specifically, in the case of doubly nested loops (i.e., 2D loop nests), both restrictions are relaxed. In other words, the code may be imperfectly nested with loop bounds of any linear convex configuration, and be outer unrolled. Since two-deep nests are by far the more common case and since it is extremely common that bounds are not constant but are linear convex, the present invention offers a very important extension to prior art optimization techniques. Furthermore, this present invention can be applied to loop nests of any depth with constant bounds, and, unlike with the prior art, the loop nest need not be perfectly nested. Since loop nests are typically not perfectly nested, this also is a very significant extension to the prior art.
The present invention pertains to a compiler for compiling source code whereby the compiled source code is optimized by performing unroll and jam on selected loop nests. The present invention allows any arbitrarily deep loop nests with constant bounds to be properly unrolled, even those loop nests having imperfectly nested code. This is accomplished by transforming the code into multiple loops. In the transformed code, the imperfect code is isolated in one of the loops while the other loops are placed in a condition such that a conventional unroll and jam operation may be applied. For three-deep or greater loop nests, the process is repeated recursively from the outer-most loop.
The present invention also allows outer loop unrolling for two-deep loop nests not only with imperfectly nested code, but also with convex bounds. This is accomplished by identifying strips of code which do not contain imperfectly nested code. An unroll and jam operation is executed for the identified strips. Code falling outside of the identified strips as well as wind-down code, are executed according to their original untransformed order.