Due to the complex nature of supercomputer architectures, tremendous effort must be expended in order to tune and optimize an algorithm or program for a target platform. Performance analysis and optimization are crucial to fully utilizing these high performance computing (HPC) systems, especially when one considers that modern HPC software generally includes millions of lines of code. With the processing speed of these powerful systems now measured in teraflops (equal to one trillion floating point operations per second), one can easily understand that it is essential to identify any performance bottlenecks quickly and accurately when deploying such an application. Without intelligent tools, it is virtually impossible to tune an application within a reasonable timeframe when the target architecture is a massively parallel supercomputer, such as the Blue Gene/L, jointly designed by IBM® and the National Nuclear Security Administration's Lawrence Livermore National Laboratory, with more than 65,000 processors. Profiling is the most commonly-used and effective approach for performance debugging of scientific codes.
One standard approach for profiling is the GNU profiler, or “gprof,” as described in “gprof: a call-graph execution profiler” by S. L. Graham et al. and many of its variations. However, gprof has several limitations. First, it lacks the ability to automatically refine the bottleneck regions with more detailed and diversified metrics other than a time metric in order to reveal the cause of the bottleneck. Second, the profiling produces very little or no differentiation in where to focus, and uniform efforts are usually spent across the range, rather than zeroing in on an area of interest. Third, the interaction with expert users is lacking. Last, but not least, gprof usually requires access to the source codes or debugging information for collecting performance metrics. This is often impossible, especially when the sources are proprietary. It may also take a prohibitively long time to re-compile.
After the introduction of gprof, a few other profiling tools emerged, for example, tprof HPROF, jprof. Typically these tools operate by providing a profile of events that occurred after one run. There is no further refinement customized for code regions that have the potential for large performance gain after tuning. Furthermore, the sampling is uniform for the entire address space being studied, and the call chain is only established for immediate parents and children. The biggest drawback of these tools is that they treat all code regions in the same way. Often the granularity of profiling for these methods is either too detailed, wasting too much time on un-interesting regions, or too coarse to be useful at all.
Liblit et al. (see B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan, “Bug Isolation Via Remote Program Sampling,” in ACM SIGPLAN PLDI 2003) presented a sampling framework for evaluating predicates in an instrumented code. The sampling is done uniformly at random. Every predicate has a probability p of being sampled. With naive implementation, a coin tossing is necessary at each predicate to see whether it is to be sampled at this execution. Yet coin tossing significantly increases the execution overhead. Liblit et al. proposed a technique based on counting down to reduce the cost of coin tossing.
At a higher level, Liblit's research is mainly concerned with bug detection and isolation. Liblit's approach relies on the fact that there are typically many instances of software code running. Sampling can be sparse at any instance, yet the overall samples from a large population are good enough.
DePauw et al. in “Drive-By Analysis of Running Programs,” 23rd International Conference on Software Engineering, ICSE 2001, proposed tracing details associated with a particular task to reduce the amount of tracing data. The “drive-by” analysis uses directed burst tracing. A burst is a set of trace execution information gathered during an interval of time, associated with a specific task in program. The analyzed programs must exhibit repetitive behavior as the solution relied on direct-request-analyze-direct cycle. The tool user has to direct the analysis to interesting regions.
Arnold et al. in “On-line Profiling and Feedback Directed Optimization of Java,” Rutgers University, 2002 use sampling for reducing the cost of instrumentation, and the high-overhead instrumented method is only run a few times. For a method F, there exist two versions, one instrumented (called duplicate) and one original. The duplicate method takes a long time to complete, and thus it is desirable to reduce its overhead. With Arnold et al.'s method, at regular sample intervals, execution moves into the duplicate in a fine-grained, controlled manner.
Sampling also appears in Arnold and Sweeney's work (see “Approximating the Calling Context Tree Via Sampling,” Technical report, IBM Research, 2000). They propose using runtime call stack sampling to construct an approximate calling context tree.
While many solutions have been proposed to the problem of optimizing performance of high performance computing systems, none of these solutions adequately address the concerns of: expending uniform effort across a range, rather than narrowing any bottleneck regions with more detailed and diversified metrics; profiling tools that require access to the source codes or debugging information for collecting performance metrics; limiting a call chain to immediate parents and children; and failure to differentiate among code regions.