Post-link code optimizers generally perform global analysis on the entire executable code, including statically-linked library code. Since the executable code will not be re-compiled or re-linked, the post-link optimizer need not preserve compiler and linker conventions. It can thus perform aggressive optimizations across compilation units, in ways that are not available to optimizing compilers. Additionally, a post-link optimizer does not require the source code to enable its optimizations, allowing optimization of legacy code and libraries where no source code is available.
At the same time, post-link optimizers must deal with difficulties that are not encountered in compile-time optimization. Optimizing compilers operate on input in the form of high-level language description, typically source code, whose semantics are clearly defined. By contrast, static post-link optimization tools receive as their input low-level executable instructions (object code). The post-link optimizer must first attempt to disassemble the object code in order to identify the data and instructions contained in the code. Even after disassembly, the semantics of executable code given to a post-link optimizer may be only partially known, for the following reasons:                Code instructions and data within an executable object are intermingled, making it impossible in some cases to distinguish between the two. Typically, there are code areas that cannot be classified unequivocally by the optimizer as either data or code instructions. In the context of the present patent application, these areas are referred to as “non-classified code areas.”        Even for fully-classified code areas that are known to contain code instructions, the semantics of the program cannot always be determined. Machine instructions operate on registers, which may contain either data information or memory locations. Therefore, the optimizer may be uncertain as to whether a given instruction performs a pure data computation, or whether it calculates an address to be used later in the program to retrieve/store data or to perform a control transfer within the program code.        Data elements and their types, such as arrays, structures or scalars, can be identified only in a high-level language, such as C, C++, Pascal, etc. In post-link code, these elements appear as arbitrary data areas, and references to them are treated as arbitrary addresses. Therefore, at the post-link level, references to data elements cannot be fully resolved.Because of these factors, the code semantics of the program may never be fully extracted from post-link code, and some of the dependencies between the data and code elements used by the program may remain uncertain.        
Haber et al. describe an approach for dealing with these difficulties in an article entitled, “Reliable Post-Link Optimizations Based on Partial Information,” in Proceedings of Feedback Directed and Dynamic Optimizations Workshop 3 (Monterey, Calif., December, 2000), pages 91–100, which is incorporated herein by reference. First, the program to be optimized is disassembled into basic blocks, by incrementally following all control flow paths that can be resolved in the program. The basic blocks are marked as either code, data or unclassified (not fully analyzed). Code blocks are further flagged according to their control flow properties. Partially analyzed areas of the program are delimited so as to contain the unclassified blocks, while relieving the rest of the program of the limitations that these blocks impose on optimization. The partially analyzed areas are chosen so that even when they cannot be internally optimized, they can still be repositioned safely en bloc to allow reordering and optimization of the code as a whole.
Use of post-link runtime profiling as a tool for optimization and restructuring is described by Henis et al., in “Feedback Based Post-Link Optimization for Large Subsystems,” Second Workshop on Feedback Directed Optimization (Haifa, Israel, November, 1999), pages 13–20; and by Schmidt et al., in “Profile-Directed Restructuring of Operating System Code,” IBM Systems Journal 37:2 (1998), pages 270–297. These publications are incorporated herein by reference.
Runtime profiling of the program creates a log recording usage statistics of each code block in two stages. First, in an instrumentation stage, each basic block is modified with either a new header or footer, wherein the added code increments a counter every time that basic block is run. In the second stage (the execution stage), the modified program is executed. At the end of the execution, the counters are written into a log file. Statistical analysis of the frequency of use of each basic block provides a method to rank the code blocks by importance. Code blocks that are frequently executed are called “hot” blocks, as opposed to rarely executed “cold” blocks.
When a function using certain registers is called during execution of a program, it is generally necessary to store (save to memory) the contents of these registers before the function starts to run, and then to restore the register contents when the function returns. For this purpose, compilers typically add appropriate store instructions to a prolog of the function in the compiled code, with corresponding restore instructions in an epilog. Because memory access has become a bottleneck for modern high-speed processors, eliminating superfluous store and restore operations can reduce program execution time substantially.
Martin et al. describe a method of compiler optimization based on eliminating storing and restoring the contents of dead registers in “Exploiting Dead Value Information,” published in Proceedings of Micro-30 (Research Triangle Park, N.C., 1997), which is incorporated herein by reference. Dead value information, providing assertions as to future use of registers, is calculated at compile time. The authors suggest that processor instruction set architectures be extended to enable this information to be communicated to the processor. In the absence of this hardware specialization, standard RISC call conventions may still allow a subset of the dead value information to be inferred and used by the processor in eliminating some of the store and restore operations at procedure calls and returns.
Cohn and Lowney describe a method of post-link optimization based on identifying frequently executed (hot) and infrequently executed (cold) blocks of code in functions in “Hot Cold Optimizations of Large Windows/NT Applications,” published in Proceedings of Micro 29 (Research Triangle Park, N.C., 1996) which is incorporated herein by reference. The object code is disassembled into component code blocks, and the control flow graph (CFG) of the flow of control through the program is constructed. Code blocks are classified into code (instructions) and data. The code sections are further classified into functions. Using profile information, the functions are analyzed to find code blocks that are rarely executed. By experimentation, the authors chose to optimize functions containing blocks with less the 1% probability of execution. The code blocks in such functions that are on the primary path of execution are labeled “hot,” and the rarely executed code blocks are labeled “cold.” All hot blocks of code in the hot function are copied to a new location. All calls to the function are redirected to the new location. Flow paths in the hot routine that target cold code blocks are redirected to the appropriate location in the original function. Once the control path returns to the original function, it does not pass back to the copied function.
The new function is then optimized at the expense of paths of execution that pass through the cold path. The optimization comprises identifying unneeded code in the new hot function, and moving it to a stub that is called when the cold portion of the function is invoked, before actually returning to the original function. Cohn and Lowney describe five different types of optimization of the hot code:                Partial dead code elimination—the removal of dead code from the hot function. Once the cold code is removed from the hot function, some of the remaining instructions may be superfluous. An example of such an instruction is an add instruction that writes to a register that is only referenced within the cold code but is positioned within the hot block. The dead code is moved to the stub.        Non-volatile register elimination—the removal of the save and restore of non-volatile registers in the hot procedure. Non-volatile registers must be stored (restored) in the function prolog (epilog). Once dead code is removed from the hot function, the use of the non-volatile registers in the hot function is analyzed. If the registers are only referenced in the cold code, the store (restore) instructions are removed from the prolog (epilog) of the hot function, and the store instructions are moved to the stub. Since the cold code is followed by the original function epilog, the original restore instructions will restore the registers.        Stack pointer adjust elimination—the removal of the stack adjusts in the hot function. If all the non-volatile store instructions can be removed from the function prolog, the stack pointer adjustment (on computer architectures that require stack adjusts) can also be moved to the stub.        Peephole optimization—the removal of self-assignments and conditional branches with an always-false condition. Once the dead code is removed and excess non-volatile registers are freed, an additional pass through the code can identify instructions that are now irrelevant. An example of such an instruction is a restore instruction of a removed register that was turned into a self-assignment by copy propagation.        Inlining the hot function—the removal of control transfer to the hot function. Code straightening can be applied to the optimized code to inline the hot function.Cohn and Lowney have implemented their methods of optimization in a tool named “Spike,” which is used to optimize executables for the Windows NT™ operating system running on Alpha™ processors. Their method of classifying blocks as hot or cold requires a complete understanding of the CFG. It cannot be used if unclassified blocks appear in the control flow of the hot function. The method of eliminating non-volatile registers also requires that there be no references to the non-volatile register left in the function after removal of dead code. Additionally, the method of elimination of non-volatile registers requires duplication of the hot code to a new location.        
Muth et al. describe the link-time optimizer tool “alto” in “alto: A Link-Time Optimizer for the Compaq Alpha,” published in Software Practice and Experience 31 (January 2001), pages 67–101, which is incorporated herein by reference. Alto exploits the information available at link time, such as content of library functions, addresses of library variables, and overall code layout, to optimize the executable code after compilation. Alto can identify control paths where stores (restores) of non-volatile registers in function prologs (epilogs) are unnecessary, either because the registers are not touched along all execution paths through a function, or because the code that used those registers became unreachable. Code can become unreachable due to other optimizations carried out by alto, for instance because the outcome of a conditional branch could be predicted as a result of interprocedural constant propagation. The number of such stores (restores) can be reduced by moving them away from execution paths that do not need them.
Alto is similar to Spike in that its optimizations require a complete understanding of the control flow within the function. The store (restore) replacements are only carried out after other optimization techniques have created dead code within the function.