1. Field of the Invention
The present invention relates to restructuring the order of a program running on a computer system to improve performance of the program. More specifically, the invention is a system that globally reorders instructions in a computer program, in order to optimize its performance, while maintaining its functionality, debuggability and structure.
2. Description of Related Art
Current high-performance computer memory architectures are optimized for programs which exhibit high spatial and/or temporal locality for both instructions and data. Memory hierarchies have evolved in an attempt to minimize cost and maximize performance by exploiting this "locality of reference" program characteristic. Basically, "locality" refers to accessing locations in memory (including cache memory) which are close to one another. It is most efficient to fill, or "pack" instructions and data (information) as closely as possible into memory. Otherwise, the program will spend a great deal of time searching widely scattered memory locations for the needed data and/or instructions.
The improved performance offered by cache memory is due primarily to the program characteristic of "locality of reference". This implies that a program usually references data or instruction memory in confined, physically close groups. A program which exhibits good "locality of reference" for either data or instruction memory references will usually realize improved performance when cache memory is present.
Cache memory is usually one of direct mapped, n-way set associative, or fully associative. Due to the expense of implementing a fully associative cache, cache memory is typically implemented as either direct mapped or n-way set associative. FIG. 8a illustrates a 2-way set associative cache. A 2-way set associative cache has two sets of memory for each cache address. As shown in FIG. 8b, two or more real addresses can share the same cache address, which reduces the probability of a cache miss and thus improves performance (as compared to a direct mapped cache). A cache miss occurs when a memory reference does not find its data in the cache and usually results in a 10.times. or more cycle time penalty.
However, a cache performance problem arises when a CPU must repeatedly fetch instructions that are separated by the approximate size of the cache set (e.g, not exhibiting good locality of reference). For example, FIG. 9a illustrates a tight program loop between three, physically separate basic blocks. For this example, the location of instructions is assumed separated by the size of the cache set. Under these conditions, real memory address A1, A2, and A3 all map to the same cache address (n). Since the cache is 2-way set associative, there is only room for 2 instructions at cache address n and, therefore, this code sequence will suffer extreme performance degradation due to constant cache "conflict" misses on at least one of every instruction fetches for each basic block throughout the execution of the loop.
Further, cache memory is usually allocated by the cache line which is typically much larger than a single instruction. Each reference to sparse or non-local instructions results in the allocation of a full cache line. The additional instructions brought into the cache, but not used, degrade cache utilization efficiency, increase cache misses, and reduce overall performance.
To reduce the chance of cache misses, it would be desirable to group instructions which are executed together in code loops as close together as possible.
Further, design assumptions are typically made regarding other program characteristics (such as branching behavior) which result in processor designs optimized for those assumed characteristics (such as branch prediction).
As long as these program assumptions hold, processor performance is maximized. However, when a program deviates from these assumed characteristics, the processor architecture is inefficiently utilized, which usually leads to reduced performance or excessive use of real memory.
While hardware design tradeoffs are made on the basis of software-related assumptions, compilers attempt to generate "optimum" code targeted for a specific hardware architecture (including the memory hierarchy) on the basis of similar program assumptions. However, compiler optimizations are usually limited to a purely static analysis of a program which includes speculation as to how a program will probably execute on a given hardware platform. Additionally, since many programs result from binding together multiple, separately compiled (or assembled) object modules, the compiler does not usually have a "global view" of the final organization of the executable image (program code) and therefore cannot perform a truly global optimization.
It can be seen that a need exists for a system that will allow the global reordering of instructions in a program while maintaining its structure and debuggability.