Performance of software systems needs to be tuned and adjusted in order to deliver an overall high performing computing system in a dynamic computing environment. The trend toward larger numbers of cores significantly increases the complexity of performance tuning by introducing numerous new performance challenges related to concurrency such as utilization, synchronization and sharing issue. Performance profiling is a tedious task, requiring repetitive manual tasks of running, measuring, and analyzing.
A prerequisite to performance tuning is the collection of performance profile data. Complex software stacks require performance data from multiple layers of the stack (i.e., Hardware, OS, middleware, application). It is generally not known prior to execution which performance events are relevant for a particular application, for instance, which events will cause a later observed performance issue such as a bottleneck.
The process of data collection has both performance and productivity challenges. Collecting all possible performance events from all layers exhaustively is prohibitive in terms of both collection overhead (time) and performance trace volume (space). Productivity challenges arise because, for example, presenting the programmer with an excessively large volume of performance data is overwhelming and further complicates the already difficult task of performance tuning.
Accordingly, it is desirable to have a solution to make data collection both more efficient, for example, by reducing runtime collection overhead, and more productive, for example, by reducing information volume to be presented to the programmer, without the loss of relevant performance bottleneck information.
While system software benchmarking platforms exist that automatically run repetitive tasks until results stabilize and automatically highlight outlying results, eliminating manual intervention in the data collection process, those platforms are still based on exhaustive data collection.
U.S. Pat. No. 5,892,947 provides a test support tool system and method, however, that patent requires the system design as the input. U.S. Pat. No. 7,103,877 only provides selective instrumentation. U.S. Pat. No. 6,971,091 generates compilation plans for optimization. U.S. Pat. No. 6,374,369 discloses using finite state machines for analyzing software performance. Those patents do not guide the entire profiling process.
Most modern processors provide certain hardware registers to record certain performance events, such as instructions retired/completed, cycles consumed, L1, L2 cache misses, translation lookaside buffer (TLB) misses, bus or memory requests. The number of registers provided for recording such events is usually limited; for example, on the Pentium 4™, only 2 registers are provided, which means only two metrics can be collected simultaneously. The challenge is to make the best use of the existing counters while getting the most comprehensive performance view. Existing technology on performance profiling addresses this problem by time-interpolation. Time-interpolation collects different metrics at different points in time. Time-interpolation is usually implemented by either multiplexing the available hardware performance counters, or by collecting performance events across subsequent executions. The set of possible performance events is usually huge so with fixed time-interpolation across all events, one may not see enough detail on some event classes or one may be provided with too much useless information on others.
The problem remains of inducing unacceptable levels of perturbation to the target system and generating unmanageable data volumes if too much instrumentation points are enabled at the same time, and of failing to reveal useful information if too few instrumentation points are enabled.
It would be desirable to circumvent the limitation on the number of hardware performance counters, control the volume of information collected, and provide a more efficient and intelligent profiling process, for example, that automatically navigates through the massive set of data streams or volumes of data, leading the problem solvers to the real bottleneck or problem in the system.