The present invention relates to computer system performance profiling, and more specifically, to basic block profiling based on sampling using grouping events.
Feedback-directed optimization (FDO) has proven useful in improving performance of computer application execution when FDO is incorporated into code optimization tools such as an optimizing compiler or binary level optimizer. A profiler is typically implemented in an execution environment that applies representative input to exercise an application with expected conditions that represent real-world use of the application or at runtime while the application is running at user site. The profiler can collect information such as basic block execution frequency or branch taken/not taken execution frequency, where a basic block is defined as a portion of code with only one entry point and only one exit point. The data collected from profiling (i.e., feedback information) can be used as training data for a code optimization tool to make better optimization decisions as FDO.
Some optimizing compilers that apply FDO use instrumentation to collect feedback information. However, this approach has significant overhead. Another approach to collect feedback information is to use hardware event sampling, which has lower overhead as compared to adding instrumentation to the application.
A common way to estimate a basic block profile is to sample a hardware counter, e.g., using a performance monitoring unit (PMU), that increments each time an instruction retires/completes. Each time the counter overflows upon reaching a predefined threshold, the instruction address is sampled by reading a program counter. Instruction retire samples are not equally distributed in each basic block, since within a group of multiple instructions that are retired/completed together one instruction that represents the group, for example, the first instruction in the group is sampled.
To solve this issue, several prior art solutions calculate an estimated average sample count in the basic block. The sample counts of all observed instructions in the basic block are typically summed and normalized by the total number of instructions in the basic block. This approach can be useful in estimating how frequently a particular instruction within the basic block is executed; however, accuracy of the estimated execution frequency is reduced in processors that group instructions dynamically at run-time, as the distribution of group assignments and group sizing within a basic block can vary over a period of time when the basic block is executed for multiple iterations.