This invention relates generally to improving the performance of high level programs on practical computing systems by enhancing data reuse through program transformations. More particularly, the present invention relates to program transformations that are derived automatically in a compiler by consideration of the flow of data blocks through the memory hierarchy of the computer system.
Increases in the speed of microprocessors have not been matched by commensurate increases in the speed of main memory (DRAM). Therefore, many applications are now memory-bound in the sense that the CPU is stalled a significant proportion of the time, waiting for memory to provide it with data. Therefore, these applications cannot take advantage of fast processors. One solution is to use fast memory (SRAM) but that is prohibitively expensive. The preferred solution is to have a memory hierarchy in which a small amount of fast SRAM (the cache) is coupled with EL large amount of slower DRAM (the memory). At any point in time, the cache contains a copy of some portion of main memory. When the processor needs to access an address, the cache is first searched to see if the contents of that address are already available there. If so, the access is said to be a hit, and the required data is acquired from the (fast) cache; otherwise, the access is said to be a miss, and the required data is accessed from the (slow) memory and a copy of this data is left in the cache for future reference. Since the size of the cache is small compared to the size of the memory, some data (called the victim data) must be displaced from the cache to make room for the new data.
For a program to perform well on a machine with a memory hierarchy, most of its data accesses must result in hits. This is the case if the program exhibits data reuse (that is, it accesses the same data several times), and if these accesses are sufficiently close together in time that all the accesses other than the first one result in hits. Many important codes exhibit data reuse. For example, given two n-by-n matrices, matrix multiplication performs n-cubed operations on n-squared data; so each data element is accessed n times. Other important codes like Cholesky and LU (Lower Upper) factorization (both well known in the art) also entail substantial data reuse. Given an algorithm with data reuse, it is important to code it in such a way that the program statements that touch or accesses a given location are clustered in time (otherwise, the data may be displaced from cache in between successive access to make room for other data that may be accessed). Unfortunately, coding such algorithms manually is tedious and error-prone, and makes code development and maintenance difficult. For example, a straight-forward coding of Cholesky factorization requires about ten lines of code; coding it carefully to cluster accesses to the same address requires about a thousand lines of code. This difficulty increases as the memory hierarchy becomes deeper. Moreover, the program is made less abstract and less portable because of the introduction of machine dependencies.
One approach to addressing this problem is to use a compiler to transform easily-written high-level codes into low-level codes that run efficiently on a machine with a memory hierarchy. A variety of program transformations have been developed by the compiler community. These transformations are called xe2x80x9ccontrol-centricxe2x80x9d transformations because they seek to modify the control flow of the program to improve data locality. The scope of these transformations is limited to xe2x80x9cperfectly nested loopsxe2x80x9d (loops in which all assignment statements are contained in the innermost loop in the loop nest). The most important transformation is called tiling; in some compilers, tiling is preceded by xe2x80x9clinear loop transformationsxe2x80x9d which include permutation, skewing and reversal of loops. These techniques are well known in the art.
A significant limitation of this technology is that the scope of the transformations is limited to perfectly nested loops. Although matrix multiplication is a perfectly nested loop, most applications like Cholesky and LU factorization are not perfectly nested loops. In principle, some of these applications can be transformed into perfectly nested loops which can then be tiled; however, the performance of the resulting code is somewhat unpredictable and is often quite poor.
Many known techniques used in conjunction with the general problems are well disclosed in the book authored by Michael Wolfe, HIGH PERFORMANCE COMPILERS FOR PARALLEL COMPUTING, published by Addison-Wesley in 1995. This book is incorporated herein by reference.
It is therefore an object of this invention to provide a system and process for enhancing data reuse by focusing on the data flow rather than on the control flow.
It is a further object of this invention to provide data-centric multilevel blocking technique for compilers and data base programs.
It is still a further object of this invention to provide means for automatically handling transformations of imperfectly nested loops in an efficient manner.
The objects set forth above as well as further and other objects and advantages of the present invention are achieved by the embodiments and techniques of the invention described below.
The present invention provides a new approach to the problem of transforming high-level programs into programs that run well on a machine with a memory hierarchy. Rather than focus on transforming the control flow of the program, the compiler determines a desirable data flow through the memory hierarchy, and transforms the program to accomplish this data flow. The key mechanism in this xe2x80x9cdata-centricxe2x80x9d approach to locality enhancement is the xe2x80x9cdata shackle.xe2x80x9d The inventive approach is to investigate all the program statements that touch upon data in a cache memory and to perform all the operations on that data without restrictions, thereby providing advantages of enhancing data reuse and speed.
A data shackle is a program mechanism defined by a specification of (i) the order in which the elements of an array must be brought into the cache, and (ii) the computations that must be performed when a given element of the array becomes available from the cache,
The order in which data elements must be brought into the cache may be specified by dividing the array into xe2x80x9cblocksxe2x80x9d (using cutting planes parallel to the array axes), and determining an order in which these blocks are to be brought into the cache. The computations that should be performed when a particular block of the array is brought into the cache are specified by selecting a reference to the array from each program statement; when a data block is brought into the cache, all instances of the program statement for which the selected reference accesses data in the current block are executed. In this fashion the data reuse in the cache is maximized.
Data shackles can be combined together to produce composite shackles. These shackles can operate on the same data structure to enhance the data reuse for other references or to exploit reuse for deeper memory hierarchies or on different data-structures to improve data reuse for other arrays. This is required for codes like matrix multiplication that work with several arrays at a time. Once the compiler has decided on the data shackles it will use, it can generate code to conform to these specifications.
The inventive data-centric approach to generating code that exploits data reuse has many advantages over existing control-centric approaches. Current control-centric technology only works for a subset of programs where reuse is present in perfectly nested loops. The data-centric approach in contrast works for all programs since it is based on orchestrating data movement directly rather than indirectly as a result of control flow transformations.
For a better understanding of the present invention, together with other and further objects thereof, reference is made to the accompanying drawings and detailed description and its scope will be pointed out in the appended claims.