1. Field of the Invention
The invention relates to computing systems and, more particularly, to performance monitoring and profiling of software applications.
2. Description of the Relevant Art
Modern processors typically include performance monitoring logic (PML) to measure processor performance while running application code and to help identify performance bottlenecks. Two features commonly found in PML are (1) the ability to count certain processor events, such as cache misses, branch mispredictions, etc., and (2) the ability to cause a trap when a counter reaches a particular value (such as overflowing from all ones to all zeroes). Diagnostic software typically configures these counters to measure processor performance over particular intervals. For example, one simple measurement may be a count of the number of instructions executed. By also counting the number of cycles over which a given number of instructions were executed, the instructions per cycle (IPC) over an interval can be derived. If the performance over a given interval drops unexpectedly, then the given interval is selected for more detailed analysis. By configuring performance monitor counters to measure other events during a rerun of the given interval, the reason for the low performance may be identified.
For example, if the IPC is found to be too low, the counters may be configured to count branch mispredictions, cache misses, or TLB misses which commonly result in low IPC. If the number of cache misses is suspiciously high, then that may point towards the cause of the reduced performance. In this manner diagnostic software can narrow the possible causes of performance issues. Once a primary cause has been identified, more in-depth analysis can be performed.
Consider the case of an unexpectedly high data cache miss rate. The next step is generally to find which processes, programs, subroutines or functions, and program statements are responsible for the majority of the cache misses. In order to further isolate the cause, periodic sampling may be used. To perform such sampling, software configures a counter to count data cache misses and a trapping mechanism to “trap” (e.g., generate a software exception) when the counter overflows. Software then sets the counter to a predetermined value based upon how frequently it wants the counter to overflow, and thus the “sample” to be taken. For example, if software wants to sample once every 3000 cache misses, then the counter would be programmed with −3000. When the 3000th data cache miss occurs, hardware would direct a trap to a software trap handler, which would then capture the PC (program counter) of the instruction that caused the counter to wrap (e.g., caused the 3000th data cache miss). Software can record the PC, reload the counter with −3000, and return to the running program. Diagnostic software can then identify which program statement was executing when the counter wrapped. By choosing an appropriate sample interval, software can build a “miss profile” which isolates the performance issue. For example, it may be that a load of a particular array element in a loop is responsible for most of the accesses. Then the program may be recoded such that the access pattern is different, or the data is prefetched, for example by judiciously inserting data prefetch instructions.
In this mode of operation, then, there are two important properties. First, the instruction which was executing when the trap was taken be should be as closely related as possible to the instruction which caused the performance event which caused the counter to wrap (ideally it would be the same instruction). The further away the instruction which was executing when the trap occurred from the event-causing instruction, the more difficult it is to associate a program statement or other information identifying the instruction that caused the event with the performance-related event. Second, event counting should be reasonably accurate (ideally it would be perfectly accurate). In other words, if N cache misses occurred, the counter would register N. In particular, the counter should not be “polluted” by events that did not occur, nor should it overcount or undercount events which did occur. Both properties involve trade-offs between implementation difficulty and chip area and power and it is difficult to build PML which satisfies both of the properties, uses a small amount of area, is simple to implement, and is simple to verify.
Accordingly, an effective method and mechanism for precisely determining identifiers of instructions causing performance-related events is desired.