1. Field of the Invention
This invention relates generally to the runtime profiles of software programs executing on computers.
2. Description of the Related Art
Runtime profiling is a mechanism for understanding a program's runtime behavior. A runtime profile is a collection of information indicating the control flow path of a program, i.e. which instructions executed and where branches in the execution took place. Profile-based optimizations can then be used for instruction scheduling, loop scheduling, data preloading, function in-lining, and instruction cache performance enhancement.
The runtime profile of a program is used by optimizing compilers and dynamic translators to focus their analysis efforts on parts of the program where greater performance benefit is likely. Advanced compilers perform optimizations across block boundaries to increase instruction-level parallelism, enhance resource usage and improve cache performance. Profile data is also useful for software developers in tuning the performance of their programs.
Program profiling typically counts the occurrences of an event during a program's execution. The measured event is typically a local portion of a program, such as a routine, line of code or branch. More fine-grained profiling is also possible based upon basic blocks and control-flow edges. Profile information for a program can consist of simple execution counts or more elaborate metrics gathered from counters within the computer executing the program.
One conventional approach to profiling is to instrument the program code by adding profiling probes to the code. Profiling probes are additional instructions which are used to log the execution of a basic block of code containing the probe. Typically, the program is compiled with the profiling probes placed within each basic block of code. The instrumented code is then executed using several different suites of test inputs to obtain several sets of profile data. The program is subsequently recompiled using the resulting profile data to give profile-based compilation of the original program.
Instrumentation based methods for gathering profile data tend to be complex and time consuming. Instrumentation of the code can result in a code size explosion due to the added instructions. The additional probe instructions also slow execution of the code and a profiled, or instrumented, version of a program can run as much as thirty times slower than the original version. Execution slow down is more than an inconvenience. Experience has shown that slow down is a major reason for profile based optimizations not being widely used in the user community.
Selection of representative test input suites for instrumented programs is important to the accuracy of the profile data. If the inputs are not selected carefully, the profile will not reflect actual usage. Programs that are highly data-dependent, such as a sort routine or a database application, have branches that are highly sensitive to user inputs. Validating the profile is difficult without a large scale study of user habits. In the absence of a user study, profiling is typically done using a large set of inputs which increases the time required to produce accurate profiling data.
However, in order to reduce the time required to obtain instrumented profiling, small test input data suites must be used to profile the program. Smaller test input suites, however, reduce the accuracy of the resultant profile data. Therefore, there is a trade-off between the accuracy of profiling and the time required to perform profiling.
There remain, however, some programs for which it is difficult or impossible to come up with representative test input data. Real time applications, such as operating system (OS) kernels and embedded systems, are excluded from the benefits of profile driven optimizations because of their execution nature. Long running applications, such as database systems, are often excluded from profiling as well.
Furthermore, analyzing and using the profile data requires additional processing steps. A program must be compiled with profiling enabled, executed using the test input suites, and then recompiled based upon the profile data. For small programs, this does not involve a large amount of overhead or difficulty. However, for large systems, such as commercial database applications, this requires significant alteration of build scripts. A large number of man-hours are invested in these scripts. In addition, large systems will require a significant amount of processing time for analysis and recompilation. As a result, software vendors are hesitant to adopt profile driven optimizations.
An alternative to instrumenting the program code is to use statistical program count (PC) sampling. The actual program runs under control of a profiler program which acts as a wrapper around the code which is being profiled. The profiler program makes an operating system call to set up a timer interrupt to be delivered to the profiler program at a predetermined frequency of X times per second. It also registers a "handler procedure" for this interrupt within the operating system. The actual program execution is then started and driven by a test input suite.
When a timer interrupt occurs, the handler procedure is invoked and the running program is suspended. At this point, the state of the machine (in other words, the program count of the process) is passed to the profiler program for recordation. The handler procedure also often records the values of many of the registers of the processor at the time of the interrupt.
The overhead of statistical PC sampling is determined by the sampling frequency X that is selected. The overhead and speed are determined by the sampling frequency. Overhead will decrease and speed will increase when the sampling frequency is decreased. However, the accuracy of the profile data is also determined by sampling frequency and increases when the sampling frequency is increased. Therefore, there is a trade-off between overhead and accuracy when selecting the sampling frequency.
Further, the statistical PC sampling approach described above typically results in too fine a level of granularity. It also doesn't really track the control flow well and requires a high level of analysis in order to use it in the process of optimizing the code. In order to perform optimization of the program code, the profile information which indicates which parts of the code are hot (in other words, those parts of the code which execute frequently) need to be mapped back to the program's control flow. This is difficult to do when the profile data is associated with a bunch of program count values that are taken at arbitrary intervals. Also, due to the high level of analysis required, the analysis of the profile data is usually performed after the runtime of the program. This has the disadvantage that some of the dynamic addressing information may be lost, such as the runtime control flow of the program within a dynamically linked library. In addition, the requirement of post-runtime analysis prevents statistical PC sampling from being used for on-the-fly optimization.
Alternatively, static methods exist which are based upon compiler assumptions and do not involve the use of profile data obtained through instrumentation of the code or execution interrupts and which do not require the code to be recompiled. However, these static estimates are typically not as accurate as profiling. For example, when static estimates are used to predict branch behavior, the inaccuracy of the predictions are approximately twice that for predictions based upon profiled information. Furthermore, control flow within dynamically bound procedures is difficult to estimate statically.
Another approach is to use existing branch handling hardware to speed up profiling. The use of hardware to reduce overhead overcomes the need to trade-off accuracy for lower profiling overhead, as is the case with statistical PC sampling. The use of hardware can also reduce the level of instrumentation required in the code which avoids the code explosion and execution slowdown which occurs in instrumented programs.
Hardware assisted methods for statistically profiling a program typically involve keeping a branch history of the behavior of that program. A branch history is obtained using a buffer which stores the history of branch behavior in a block of code by storing a one in the branch history buffer for each branch taken within the basic block and a zero for each branch that is not taken.
An example of a hardware assisted profiling technique that uses existing branch handling hardware in commercial processors is proposed in Conte, Patel and Cox, "Using Branch Handling Hardware to Support Profile-Driven Optimization", MICRO 27, November 1994. The scheme described obtains profiles having high accuracy with only a 0.4%-4.6% slowdown in execution for use in branch prediction hardware.
To predict a branch in existing branch prediction hardware, the branch instruction's address is combined with the current value of the branch history. This can be a global branch history of the last k branch outcomes or a table that has a per-branch history, i.e. the last k outcomes of the same branch. The resulting value is used to index into a predictor table in order to read off the prediction. After the branch actually executes, the outcome of the branch (0/1) is shifted into the branch history buffer. The branch history buffer may be a global buffer that records the outcome of every branch that executes, or it may be a per-branch buffer that records only the past history of the same branch. Bits are simply shifted off the end of the branch history register and no check is made to see if it is full. Only direct branches are handled by modern branch prediction hardware, indirect branches cannot be predicted.
Conte et al use the branch prediction hardware typically used in modern microprocessors for branch prediction to obtain profile information about a running program with very low overhead. Their scheme works as follows: (1) The program to be profiled is enhanced with a table of control flow graph (CFG) arcs. A CFG is illustrated in FIG. 1, where the arcs are represented as arrows between code blocks A-F. The CFG structure represents the static control flow of the program, as determined by a compiler compiling the program. (2) During runtime, the operating system kernel periodically reads the branch history information recorded in the branch prediction buffers, and uses it to increment counters associated with the CFG arcs. This process can be viewed as converting the CFG into a Weighted Control Flow Graph (WCFG), because the arcs of the CFG are distinguished (or "weighted") by the values of the counters that are associated with them. In order to keep the overhead low, the CFG arc counts can be updated in memory, and the entire WCFG written out to disk after the program completes execution.
Modern branch prediction hardware typically consists of a buffer, indexed by branch instruction addresses, that contains information about the recent history of branch behavior. There are many ways of organizing the history information, for example each buffer entry may contain a record of the same branch's previous outcomes (a per-branch history), or each buffer entry could contain the outcomes of the sequence of branches that immediately preceded this branch the last time this branch was executed (a global branch history). In either case, this branch history information is extracted from the buffer entry, and used to predict the outcome of the current instance of a branch. FIG. 2A illustrates one way of organizing a branch target buffer which is indexed by the branch instruction address. Again, there are several ways of using the history information to obtain a prediction, for example the history value can be combined with the branch instruction address and the resulting value used to index into a predictor table to obtain a predicted outcome for the current branch instruction. FIG. 2B illustrates an example of a history register table 22 which is indexed with the branch instruction address to obtain the branch history for indexing into a predictor table 24.
Once the branch instruction is actually executed, the branch history information maintained by the branch prediction hardware is updated to account for the actual outcome (0/1) of this branch. This is typically done by extracting the branch history from the buffer entry indexed by this branch into a shift register, shifting in the outcome of this branch at the end, and storing the new branch history value back to the buffer entry. FIG. 3 illustrates a history register table 22 with three sample buffer entries.
In Conte's scheme, when the operating system samples the information recorded in the branch prediction hardware's buffer, it estimates the number of times a particular branch executed, and then associates this count with the CFG arc that represents the branch instruction. There are two possible CFG arcs corresponding to each branch instruction, one for the taken direction (denoted by a 1 in the branch history) and the other for the not-taken direction (denoted by 0 in the branch history). Conte et al suggest several heuristics to estimate a CFG arc count from the branch history information, for example, the number of I's in the branch history divided by the length of the history gives an estimate for the number of times the branch was taken.
Over counting of an arc's weight can occur if the branch history information is sampled more frequently than it changes. Zeroing out the branch history each time it is sampled by the operating system does not solve the problem, because "0" entries in the history also signify not-taken branches. The solution suggested by Conte et al is to use a leading "1" as a marker bit, shown in FIG. 3, to denote the boundary between invalid and valid branch histories. After the branch history is sampled by the operating system, it zeros the history and sets the least significant bit (LS B) of the branch history to 1. Thereafter, when the branch history shifting logic updates the branch history, this bit shifts to the left. Some additional logic is also required to detect when the marker bit reaches the most significant bit (MSB) position of the shift register. Once this occurs, Conte et al suggest an extra "full-bit" associated with the branch history be set to 1, indicating that the entire contents of the history are valid. FIG. 3 illustrates a buffer entry having its "full-bit" set to 1. However, the contents of the history itself (excluding the full-bit) will continue to be shifted to the left, so that leftmost bit will get shifted off the end.
The disadvantage of Conte et al's scheme is that the branch history information maintained by the branch prediction hardware is shared by all programs running on the processor, and is not part of the state of the profiled program. Thus, not only can different branches of the same program map to the same branch history entry, but branches in different programs can also map to the same branch history entry. Therefore, a bit in a branch history may correspond to the outcome of an arbitrary branch in any of the currently executing programs. Because Conte et al are only interested in estimating arc counts, this only decreases the accuracy of the count, but does not affect the integrity of their scheme. However, this branch history information cannot be used to reconstruct the actual sequence of branch instructions executed by the program at runtime. This is only possible if the branch history is kept as part of the executing program's state, and is saved and restored by the operating system during a context switch. Furthermore, the only way to determine frequent execution paths in the program with Conte et al's technique, is to do an analysis of the WCFG to locate the arcs with the highest weights and try to string them together to form traces. The high level of analysis required to process the WCFG makes it too expensive to apply at runtime while the program is executing.
Another disadvantage of Conte et al's profiling technique is that the program has to be essentially "instrumented" by enhancing it with the CFG structure. In addition, indirect branches (i.e., branches whose targets may be different for different executions of the branch) cannot be handled, requiring the compiler to convert indirect branches into a sequence of conditional direct branches in order to profile them. Both these problems make this scheme unusable on legacy program binaries (i.e., programs that cannot be recompiled).
The simplified microprocessor architecture 100 of FIG. 4 will now be used to illustrate the workings of a conditional direct branch. A program count register 130 is loaded with a program count value by the NEXT PC logic 120. The program count value is output onto an ADDRESS BUS which accesses memory in order to obtain an instruction. The instruction corresponding to the program count value is placed on a DATA BUS for loading into instruction register 140. The instruction is then decoded by instruction decoder 150 for input to the timing and control logic 160 for the processor. In the event that the instruction is a branch command, a branch target address will also be loaded into data/address register 128.
The timing and control logic 160 generates the timing and control signals which drive the other functional blocks of the processor. For instance, the timing and control logic 160 will select the contents of one or more registers in register file 180 for output as operands to arithmetic logic unit (ALU) 170 for processing. The timing and control logic 160 will also drive the NEXT PC logic 120 to select the next program count value to load into program count register 130.
The timing and control logic 160 generates the timing and control signals responsive to the instruction decoded by instruction decode logic 150 and the state of condition flags N, Z and C generated by ALU 170. The nonzero flag N is set by ALU 170 when it detects a nonzero value in an accumulator of the ALU. Similarly, the zero flag Z is set by ALU 170 when it detects a zero value in the accumulator. The carry flag C is active when the operation performed by the ALU 170 results in a carry-out condition.
The timing and control logic 160 integrates the conditions flags N, Z and C with the information from the instruction decode logic 150 in order to determine the state of the branch signal. For instance, a branch-on-zero-condition instruction would cause the timing and control logic 160 to generate an active branch signal if the Z flag is active. The branch signal would then cause the NEXT PC logic 120 to load the branch target address value from the data/address register 128 (which would have been loaded with the target address along with the loading of the branch command into the instruction register 140) into the program count register 130 so that execution flow proceeds to the target address. If the Z flag is not active, then the branch signal remains inactive, no branch operation is performed, and the NEXT PC logic 120 increments the program count value to obtain the next instruction in the execution sequence.
The processor architecture of FIG. 4 is one simplified example of a processor architecture. Other architectures exist which involve more complex NEXT PC functions, instruction decoding and branch conditions.