1. Field of the Invention
The present invention relates in general to the field of software code development, and more particularly to application performance optimization through code outlining.
2. Description of the Related Art
Computer software applications continue to grow in size and complexity, generating large code bases in the process. Development of these applications has traditionally followed an iterative process of coding, compiling, debugging and optimization to improve code operation and/or performance. As the size of the code base increases, efficient memory hierarchy usage can be a factor in application performance.
In most computers, whenever a memory location is referenced by a program, the information located in the referenced location, along with information from nearby memory locations, is brought into a cache. While a hardware cache checks every requested address to determine whether the data is present, it is impractical to check whether every instruction in a program is present in the cache, as several overhead instructions may be used for every useful instruction executed. Program code is often divided into blocks which are loaded into the cache as a unit, and as long as execution proceeds within a block, no cache checks are needed. However, if the flow of control leaves a block, it may be necessary to check and determine whether the next block has already been loaded into the cache.
References to data currently in a cache-line can be one or two orders of magnitude faster than references to main memory. During the execution of a program, placement of memory address references is referred to as spatial locality, for reuse of a memory location within a cache-line, and temporal locality, when the same memory location is reused before its cache-line is evicted. The locality of a program can be improved by changing the order of computation (referred to as iteration reordering), or the assignment of data to memory locations (referred to as data reordering), so that references to the same or nearby locations occur relatively close in time during the execution of the program. In general, known compiler optimization techniques can maximize the instruction cache efficiency by iteratively reordering functions and basic blocks to improve both temporal and spatial locality. Such improvements are typically achieved by placing infrequently executed basic blocks (i.e., cold blocks), away from the main function body of frequently executed blocks (i.e., hot blocks), in an optimization technique referred to as “code outlining.”
Control is typically transferred to and from hot and cold blocks via a control transfer instruction (CTI). Reduced instruction set computer (RISC) architectures typically use 32-bit fixed length instruction formats to improve the speed of instruction fetch and decoding. While this fixed length feature limits the distance between the CTI instruction and its target, thereby limiting the maximum distance between the hot and cold blocks, the fixed length also limits the optimal potential performance of basic block outlining. Without knowledge and control over the final code layout, a compiler generally uses a “trampoline” to redirect execution flow to outlined cold blocks. A trampoline is a relatively small piece of code, typically created at run time, that enables branches to occur outside of the confines of more or less contiguous binary program code.
For example, FIG. 1, labeled prior art, shows a code outlining optimization implementation using trampolines. The code outlining optimization implementation includes computer system 100 having main memory 112, which contains operating system 114, which allows implementation of compiler 118, and linker 122. Compiler 118 converts source code 116 into object code 120, which is linked by linker 122 into executable code 124. Source code 116 may include any computer program written in a high-level programming language. Executable code 124 includes executable instructions for a specific virtual machine or specific processor architecture. Compiler 118 interacts with code outliner 126 to place infrequently executed basic blocks (i.e., cold blocks), away from the main function body of frequently executed blocks (i.e., hot blocks). Code outliner 126 includes a trampoline insertion module 128 to insert trampolines to enable branches to occur outside of the confines of more or less contiguous binary object code 120.
In some cases, these branches can occur when seldomly executed basic blocks (cold blocks) are located beyond a predetermined branch distance limit such as in RISC architectures. The introduction of trampoline code, placed within a CTI's target distance limit can solve the distance limitation problem, but can incur performance penalty in the process. Furthermore, introduction of additional instructions such as trampolines, can cause an increase in the number of instructions executed, and additional control redirection which can impact instruction cache efficiency. What is needed is a way to achieve the benefits of code outlining without incurring the performance penalties incurred through the use of trampolines.