1. Technical Field
The present invention relates to an improved data processing system. In particular, the present invention relates to loop optimization transformations. Still particularly, the present invention relates to a generic language interface that allows programmers to apply loop optimization transformations on loops in data processing system programs.
2. Description of Related Art
In conventional computing systems, processors execute program instructions by first loading the instructions from memory, which may either be a cache memory or main memory. Main memory is a storage device used by the computer system to hold currently executing program instructions and working data. An example of main memory is random access memory (RAM).
Cache memory is a fast memory that holds recently accessed data, designed to speed up subsequent access to the same data. When data are read from or written to the main memory, the cache memory saves a copy along with associated main memory address. The cache memory also monitors addresses of subsequent reads to see if requested data is already stored in the cache. If it is, a cache hit occurs and the data is returned immediately. Otherwise, a cache miss occurs and the data is fetched from main memory and saved in the cache.
Since the cache memory is built from faster memory chips than the main memory, a cache hit generally takes less time to complete than a main memory access. Therefore, multiple levels of cache memory may be implemented in a computer system to provide faster or slower access time to data. For example, level one cache is smaller in size and located closer to the processor, which provides faster access time. On the other hand, a level two cache is larger in size and provides slower access time than level one cache.
While level one cache may locate in close proximity with the processor, level two cache may be located further away from the processor. If an attempt made to access data from the level one cache fails, the processor often steps up to the level two cache or higher to access the same data. Thus, a system may have several levels of cache memory that catch lower level cache misses before attempting to access from main memory.
Cache memory relies on two properties when accessing program data: temporal locality and spatial locality. Temporal locality addresses frequency of data access. If data is accessed once, the same data is likely to be accessed again soon. Spatial locality addresses the location of data in memory. If a memory location is accessed then nearby memory locations are likely to be accessed.
To exploit spatial locality, cache memory often operates on several words at a time, which is known as a cache line or cache block. On the other hand, main memory reads and writes in terms of a number of cache lines or cache blocks. Previously, attempts have been made to reduce cache miss rate in computer systems. These attempts include utilizing larger block size, cache size, and pre-fetching instructions. However, these attempts require associated hardware changes.
In recent years, other attempts have been made using software optimizations, such that program instructions may be reordered to reduce the number of cache misses. These software optimization transformations may be done by an optimizing compiler. Examples of software optimization techniques include merging arrays, loop interchange, and blocking. Merging array improves spatial locality by using a single array of compound elements, rather than two arrays of single elements. This technique reduces potential conflicts of data elements in the cache memory when data elements are accessed for the two arrays. Loop interchange changes nesting of loops to access data in the order stored in memory, which improves spatial locality. Blocking, or “tiling”, improves temporal locality by accessing cache-contained “tiles” of data repeatedly, rather than iterating a whole column or row of data.
Currently, in order to optimize the program by performing loop transformations, programmers have to modify existing program instructions to insert their own performance tuning code. Programmers may also rely on the compiler to heuristically apply the performance tuning transformations at compile-time. However, programmers may not interact with the compiler directly to tune their programs using complex loop transformations without first modifying existing program instructions, or do so in a limited way. This situation makes it difficult for programmers to control the compiler optimization process in order to apply complex loop transformations.
Therefore, it would be advantageous to have a method and apparatus that allows programmers to gain control of the compiler optimization process in order to apply complex loop transformations. Also, it would be advantageous to have a method and apparatus that allows programmers to direct the compiler to perform loop transformations without modifying existing program instructions. Furthermore, it would be advantageous to allow other compilers to apply the loop transformations or ignore them completely without changing the semantics of the existing program.