The present invention relates to optimizing compilers. More specifically, the present invention relates to compilers and tools for producing optimized code which employs the cache memory system provided in a computer the code is executing on.
A compiler is one tool that is used to convert computer programs written in high level programming languages into the machine code which is executed by the CPU(s) in a computer system. Depending upon how the compiler performs this conversion, the resulting program can execute at different speeds on the computer and/or can require more or less system memory and more or less storage space.
Much work has been done in the past to create compilers which do more than just create a direct translation from source code to machine code. Such compilers are typically referred to as optimizing compilers and they operate to analyze the source code provided to them and to then select and implement appropriate strategies and/or machine code structures that will execute more efficiently on the target computer system than a mere direct translation would.
While optimizing compilers can employ many techniques such as loop transformation and/or data remapping to produce efficient machine code, advances in computer hardware have introduced new challenges to compiler designers. Specifically, the clock speed of CPU devices has undergone an increase in recent years, while system memory speeds have lagged behind. Unmanaged, this speed discrepancy, which is typically referred to as memory latency, causes the CPU to wait idly while data is read from or written to system memory.
To address memory latency, caches may be employed. Caches are relatively small (relative to the size of system memory) banks of memory which can be accessed faster than system memory, but which may be more expensive than system memory and/or which are optimally located within the system architecture for faster access. The intention is that required data will be read into the cache before it is required by the CPU, thus hiding memory latency from the CPU. If a data element required by the CPU is available in the cache, it is referred to as a cache “hit”, while if the required data element is not available in the cache, it is said that a cache “miss” has occurred and the CPU must wait while the required data element is retrieved from system memory. Most CPUs now include some amount of cache memory on their chip dies, but the amount of available die area limits the size of the cache that can be placed on-chip. Additional cache memory can be provided in processor assemblies and/or at the system memory subsystems.
Caches are typically arranged in a hierarchy denoted by Levels, with the Level closest to the CPU (usually on the CPU die) being referred to as Level 1 (L1). L1 cache is at the top of the cache hierarchy and Level 2 cache is the next lower level of cache, etc. In the IBM Power4™ system architecture, for example, Level 1 (L1), Level 2 (L2) and Level 3 (L3) caches are provided between the CPU(s) and the system memory.
To manage the memory latency issue, computer designers are employing caching hardware and cache management techniques. The hardware cache managing mechanisms provided in the IBM Power4™ system architecture includes hardware pre-fetch support for the caches. This pre-fetch hardware can recognize up to eight streams of data accessed by an application and will pre-fetch data for those streams to the L1 cache so that the CPU does not have to wait for data required by these streams to be recalled from main memory. A stream, in this sense, is a sequence of stride one memory accesses which are adjacent, or closely located, locations in memory. An example of a stream would be sequential reads from system memory of the sequential elements in a one dimensional array.
Caches have a variety of limitations or conditions that must be considered to utilize them effectively. Data is moved in or out of caches in aligned chunks called cache lines and caches are arranged into a number of cache lines of fixed size. In the above-mentioned IBM Power4™ system, the L1 cache is 32 kB in total size and the cache is arranged in 256 cache lines of 128 bytes each. Data elements which are accessed temporally or spatially ‘close’ are located within a single cache line, if possible.
Another limitation of cache memories is the possibility for cache conflicts. Caches employ a mapping technique to place a data element stored at a location in a potentially very large system memory into a location in the much smaller cache. A conflict occurs when the mapping technique results in two required data elements being mapped to the same location within the cache. It is possible that multiple data elements in system memory that are required to be cached will be mapped to the same location in the cache. In such a case, a first element cached will be overwritten in the cache by any subsequent element to be cached at that same location and the attempted access to that first element, now overwritten, will result in a cache miss.
In an attempt to reduce the frequency with which such cache conflicts can occur, many caches employ set associativity which essentially provides sets of locations which a system memory location can be mapped to. The above-mention L1 cache in the Power4™ system employs two-way set associativity and thus the probability of a conflict occurring can be halved as the mapper function can place a required data element at a given location in either of the two sets of cache lines to avoid a conflict with a required data element already mapped to that location in the other of the two sets of cache lines. However, such conflicts may still occur and can be problematic if the size and/or arrangement of the elements in an array or other data structure is some multiple of a component of the mapping function such that multiple elements of an array will be mapped to the same location in the cache.
One optimization strategy used in compilers to improve cache utilization is data remapping which is the re-organization and re-arrangement of how data is stored in the system memory. For example, a compiler can arrange the data storage of a two dimensional array of data in system memory so that elements adjacent in array rows are adjacent in the system memory if the array is accessed in row order by the application (typically referred to as row major access). Alternatively, the compiler can arrange the data storage of the two-dimensional array of data in system memory so that elements adjacent in array columns are adjacent in the system memory if the array is accessed in column order (typically referred to as column major access).