1. Field of the Invention
The present invention generally relates to optimization of computer programs and more particularly, the present invention relates to ordering portions of a computer program and data used therein for improved execution order.
2. Background Description
A typical state of the art computer, whether a large mainframe computer or small personal computer (PC), includes a hierarchical memory system. The memory system may include nonvolatile storage, main memory and cache memory. Non-volatile storage, typically is disk storage or a hard disk. Main memory is typically relatively low cost, dense dynamic random access memory (DRAM). Usually cache memory is faster, more expensive, less dense static random access memory (SRAM). Typically, cache memory is a small portion of the entire memory system, e.g., 1-10%.
Memory paging systems are designed to exploit spatial and temporal locality. Temporal locality refers to the tendency of programs to execute instructions repeatedly; thus the performance of fetching instructions from main memory can be improved by saving recently executed instructions in a small high-speed cache. Instructions are said to exhibit good spatial locality in a program if execution of an instruction tends to be followed quickly by execution of instructions packaged nearby. A program with poor spatial locality will cause unneeded instructions to be fetched into the cache, preventing cache operation at its full potential. For these hierarchical memory systems, volatile memory may be thought of as a medium-speed cache for low-speed persistent memory, such as a disk. Recently used pages are kept in memory to take advantage of temporal locality. Again, good spatial locality is required to avoid bringing unneeded instructions and data into memory. Poor spatial locality thus reduces the efficiency of memory paging.
Further, processor performance is increasing much more rapidly than the performance of their attached memory subsystems. So, it is increasingly difficult to feed data and instructions to processors rapidly enough to keep the processors utilized to their maximum capacity. As a result, a great deal of ingenuity has been expended on hardware solutions to improve the access time and memory reference throughput, including caches, prefetch buffers, branch prediction hardware, memory module interleaving, very wide buses, and so forth. Also, software may be optimized to take the best possible advantage of these hardware advances.
Unfortunately, xe2x80x9cnaivexe2x80x9d code generation often results in programs that have poorer spatial locality than is achievable. It is typical, for example, to generate code that branches around infrequently executed error paths. This results in poor utilization of the instruction cache, since some of the error path code will usually be fetched into the cache along with the branch that bypasses it. It is also typical for computational procedures to be packaged without consideration for locality, so that although procedure A frequently calls procedure B, A and B are located in different memory pages. Accordingly, it is becoming more common to use profiling information (profiling) to analyze program behavior during execution.
Optimization procedures have been developed to optimize program code selecting segments that are most likely to be used. Those selected code segments are typically stored in cache memory. Also, large data sets may be used by a program which itself fits in cache, but the large data sets may be so large as to not fit completely into a cache memory. Program execution from each of these examples can be improved by more efficient segment caching.
With the introduction of instruction caches, which have been designed to exploit temporal and spatial locality, profiling focus was shifted to reordering code at a finer granularity. Most successful approaches to improving instruction cache performance have used profile data to predict branch outcomes. In contrast to most of the foregoing work on virtual memory performance, these techniques were implemented within the framework of optimizing compilers. Profiling gathers data about the frequencies with which different execution paths in a program are traversed. These profile data can then be fed back into the compiler to guide optimization of the code. One of the proven uses of profile data is in determining the order in which instructions should be packaged. By discovering the xe2x80x9chot tracesxe2x80x9d through a procedure, the optimizer can pack the instructions in those traces consecutively into cache lines, resulting in greater cache utilization and fewer cache misses. Similarly, profile data can help determine which procedures call other procedures most frequently, permitting the called procedures to be reordered in memory to reduce page faults. Thus, profile information has been used to reduce conflict misses in set-associative caches, for example. Also a reduction in conflict misses has been achieved using only static analysis of the program. Further, basic blocks have been traced to reduce the number of unexecuted instructions brought into the instruction cache (cache pollution), and to order basic blocks. It also is known that infrequently executed traces can be separated entirely from the main procedure body for additional efficiency. Other methods of reordering instructions are based on the presence of loops, branch points, and join points in the control flow graph, as well as based directly on the control dependence graph.
Thus, it is a goal of a good cache management procedure to take advantage of locality properties. Ideally, blocks of instructions or data that are expected to be used together are stored together within the cache memory. Program slow downs occur when code currently being executed by the computer need code that is outside of the cache or, even worse, outside of main memory, stored in non-volatile storage, e.g. on disk. This can happen for example with a branch or a call or when the calculation being done on data in a database that may be partially stored in cache requires data from that database that is not stored in the cache. Each branch to code out of the cache and, even more so, to code out of the main memory slows execution.
Accordingly, optimizing compilers have been developed which convert source code into object code and to what is hoped to be an optimum instruction order such that execution is maintained within code in cache at the particular point in time. These optimizing compilers typically attempt to group code into manageable groups that are compartmentalized or contained within a reasonably sized segment such that the segment may be maintained in cache memory while it is being executed.
However, the cache optimization program cannot improve the code itself, i.e., if blocks of instructions are not organized such that related blocks are in close execution proximity to each other, the cache optimization program cannot guess which blocks are more likely to be executed and what is the optimum execution order. R. R. Heisch in xe2x80x9cFDPR for AIX Executables,xe2x80x9d AIXpert, No. 4 (August 1994), pp. 16-20 and xe2x80x9cTrace-Directed Program Restructuring for AIX Executables,xe2x80x9d IBM Journal of Research and Development 38, No. 5, 595-603 (September 1994) teaches that instruction cache performance can be maximized by considering it as a whole-program optimization. Heisch""s methods differ from previous approaches by operating as a post-processor on executable program objects and by allowing basic blocks to migrate without being constrained by procedure boundaries. The reordered code was appended to the original executable program objects, resulting in reported growth in executable file size of between 5 and 41 percent. This growth had negligible impact on performance. I. Nahshon and D. Bernstein, xe2x80x9cFDPR-A Post-Pass Object Code Optimization Tool,xe2x80x9d Proceedings of the Poster Session of CC ""96xe2x80x94International Conference on Compiler Construction, Sweden (April 1996), pp. 97-104 produced an improved algorithm that required less code growth. A FDPR (feedback-directed program restructuring) tool by IBM Corporation embodies the teachings of Heisch and Nahshon et al. W. J. Schmidt, R. R. Roediger, C. S. Mestad, B. Mendelson, I. Shavit-Lottem and V. Bortnikov-Sitnitsky, xe2x80x9cProfile-directed restructuring of operating system codexe2x80x9d, IBM Systems Journal, Vol. 37, No. 2, teach a profiling system for restructuring the code, based on observed frequencies of calls to basic blocks and the usage of particular branches within the code. Such systems collect execution statistics, run on typical applications, count the number of times each basic block (B) is executed and, optionally, the number of times each sibling block (B1, B2, . . . ) is expected to execute after B, where B1, B2 . . . are all the possible successors of B according to the code. Blocks that are executed substantially more often than others are considered xe2x80x9chotxe2x80x9d and their positions within the code are then revised with respect to the profile information, so that the restructured code is expected to run more efficiently on typical applications. Similarly, data sets can be restructured with respect to affinities that are induced by execution patterns, so that blocks of data that tend to be executed closely together, are placed in proximity to each other.
Unfortunately, a typical application may be running in several different modes and, therefore, a xe2x80x9chotxe2x80x9d block may be dependent on the mode. For example, an application may have two modes of execution, an equation solving mode and a simulation mode. Naturally, blocks that are related to equation solving are hot when the system is solving an equation, while those related to simulation are hot while the system is in its simulation mode. However, deriving global block statistics may result in deeming that none of the blocks are hot since, on the average, none is used significantly more often than any other. Thus, even though programmers may have designed the source code in a way that promotes locality of these two modes, significant portions of the code, very likely are placed in a structure that does not take advantage of this modality and the instruction cache.
Thus, there is a need for a method of organizing programs for optimal block placement under all operating conditions.
The present invention is a program product and method of compiling a computer program to optimize performance of a computer program. First, after initialization, a profiling run is done on computer code which may include program code blocks and data in a database. Execution of each computer program step is monitored and each occurrence of each individual code unit is logged, e.g. each instruction block or block of data. Frequently occurring code units are identified periodically as hot blocks. An initial snapshot of hot blocks is logged, e.g., when identified hot blocks exceed an initial block number. Profiling continues until the profiling run is complete, updating identified hot blocks and logging a new current snapshot whenever a current set of identified hot blocks contains a selected percentage of different hot blocks. Snapshots are selected as representative to different program modes. The program is optimized according to program modes.