Message passing serves as an effective programming technique for exploiting coarse-grained concurrency on distributed computers as evidenced by the popularity of Message Passing Interface (MPI). Nowadays, most applications for tera-scale computing environments, such as the systems at the National Science Foundation Alliance Centers, the Department of Energy Accelerated Strategic Computing Initiative (ASCI) Labs, and the computational grid, rely on MPI for inter-nodal communication. Often, users even use MPI for intra-nodal communication because many versions of MPI provide highly optimized communication primitives that use shared memory rather than networking protocols for message transfers.
The performance of these distributed applications may be challenging to comprehend because a distributed application's performance stems from three factors: application design, software environment, and underlying hardware. This comprehension is even more complex when considering computer systems with hundreds, if not thousands, of processors. One strategy for performance optimization of these applications is to eliminate the distributed application's communication inefficiencies. The communication inefficiencies arise under many scenarios: one common scenario is where processors are stalled for long periods waiting to receive a message or when the application loses the opportunity to perform computation simultaneously with communication. Traditionally, programmers infer explanations for these types of inefficiencies manually, basing their conclusions on knowledge of the source code, message tracing, visualizations, and other customized performance measurements. This manual analysis may be time-consuming, error-prone, and complicated, especially if the developer must analyze a large number of messages, or if the messages are non-deterministic.
Only recently have researchers addressed the higher level goal of automating the task of performance analysis per se. Many performance analysis tools provide some level of support to help users locate performance problems. For example, as discussed in a publication by Meira, Jr., W., LeBlanc, T. J., and Poulos, A. entitled “Waiting Time Analysis and Performance Visualization in Carnival,” Proc. ACM SIGMETRICS Symposium on Parallel and Distributed Tools,” 1996, pp. 1-10, Carnival attempts to automate the cause-and-effect inference process for performance phenomena. Carnival supports waiting time analysis, an automatic inference procedure that explains each source of waiting time in terms of the underlying causes, instead of only its location. In a publication by Rajamony, R. and Cox, A. L. entitled “Performance Debugging Shared Memory Parallel Programs Using Run-Time Dependence Analysis,” Performance Evaluation Review (Proc. 1997 ACM International Conference of Measurement and Modeling of Computer Systems, SIGMETRICS 97), 25(1):75-87, 1997, there is a discussion of building a performance debugger that automatically identifies code transformations that reduce synchronization and communication. The performance debugger suggests predefined code transformations to the user in terms of application source code. In a publication by Miller, B. P., Callaghan, M. D., Cargille, J. M., Hollingsworth, J. K., Irvin, R. B., Karavanic, K. L., Kunchithapadam, K., and Newhall, T., “The Paradyn Parallel Performance Measurement Tool,” IEEE Computer, 28(11): pp. 37-46, 1995, there is discussed a performance tool that provides an automated performance consultant that uses the W3 search model to test and verify hypothesis about a target application. The W3 Search Model abstracts those characteristics of a parallel program that can affect performance and using Paradyn's dynamic instrumentation, it tries to answer three questions about these performance hypothesis: why is the application performing poorly, where is the performance problem, and when does the problem occur? Tmon, as discussed by Ji, M., Felten, E. W. and Li, K, “Performance Measurements for Multithreaded Programs,” Proc. SIGMETRICS 98, 1998, pp. 161-170 is a performance tool for monitoring and tuning multithreaded programs. Tmon measures thread waiting time and constructs graphs to show thread dependencies and bottlenecks. It also identifies “semi-busy-waiting” points, where CPU cycles are wasted on condition checking and context switching. Quartz as discussed in Anderson, T. E., and Lazowska, E. D., entitled “Quartz: A Tool For Tuning Parallel Program Performance,” Proc. 1990 SIGMETRICS Conf. Measurement and Modeling Computer Systems, 1990, pp. 115-125, extends the ideas of UNIX gprof to shared memory multiprocessors and provides a new metric named “normalized processor time.” This metric directs programmer's attention to those sections of source code that are responsible for poor parallel performance by revealing poor concurrency based on the number of idle processors.
Cray Computer™ provides two tools that help locate performance problems with applications: ATExpert™ and MPP Apprentice™. ATExpert™ analyzes application source code and predicts autotasking performance on a dedicated Cray PVP system. MPP Apprentice™ helps developers locate performance bottlenecks on the Cray T3E. MPP Apprentice™ reports, for example, time statistics summed across all compute nodes for the whole program, as well as each loop, conditional, or other statement block. It provides the total execution time, the synchronization and communication time, the time to execute a subroutine, and the number of instructions executed. It also can offer advice on how to get rid of the performance bottlenecks. Because both MPP Apprentice™ and ATExpert™ are commercial tools, it is not possible to evaluate their underlying techniques for automation.
A variety of researchers have applied statistical techniques to performance data in an effort to reduce data volume or to automate tasks for the user. These techniques include covariance analysis, discriminant analysis, principle component analysis, and clustering. For instance, clustering, as described in Johnson, R. A., “Applied Multivariate Statistical Analysis,” Englewood Cliffs, N.J., USA: Prentice-Hall, 1982, is a well-known data analysis technique that categorizes a raw dataset in the hopes of simplifying the analysis task. The primary goal of clustering is to segregate data points into clusters where points that belong to the same cluster are more similar than to points in unlike clusters. As discussed in Reed, D. A., Aydt, R. A., and Noe, R. J., Shields, K. A., and Schwartz, B. W., “An Overview of the Pablo Performance Analysis Environment,” Department of Computer Science, University of Illinois, 1304 West Springfield Avenue, Urbana, Ill. 61801, 1992, and Reed, D. A., Nickolayev, O. Y., and Roth, P. C., “Real-Time Statistical Clustering and For Event Trace Reduction,” J. Supercomputer Applications and High-Performance Computing, 11(2): 144-59, 1997, one instance of using clustering for performance analysis identifies clusters of processors with similar performance metrics and then, it selects only one processor from each cluster to gather detailed performance information. As a result, clustering can reduce data volume and instrumentation perturbation. One downfall of using statistical techniques for analyzing performance data is the difficulty of mapping the results of an analysis back to the source code. More importantly, statistical techniques, on occasion, can provide correct, but enigmatic results that can be difficult for programmers to use in improving their application's performance. In the knowledge discovery field, Lee and colleagues as discussed in Lee, W., Stolfo, S. J., and Mok, K. W., “Mining In A Data-Flow Environment: Experience In Network Intrusion Detection,” Proc. Fifth ACM SIGKDD International Conference Knowledge Discovery and Data Mining, 1999, pp. 114-124, have focused machine learning techniques on traces of Internet network activity to provide an automated support for intrusion detection in computer networks. Their system learns normal network activity by modeling user's habits and locations as well as other activity, such as nightly backups. Then, their system takes appropriate actions when activity appears suspicious. To assist with the analysis of large trace files, numerous researchers have investigated visualization techniques including Heath, M. T., Malony, A. D., and Rover, D. T., “The Visual Display of Parallel Performance Data,” Computer 28(11): 21-28, Stasko, J., Domingue, J., et al., “Software Visualization: Programming as a Multimedia Experience,” MIT Press: Cambridge, Mass., 1998, Bailey, D., Barszcz, E., Barton, J., Browning, D, Carter, R., Dagum, L., Fatoohi, R., Fineberg, S., Frederickson, P., Lasinski, T., Schreiber, R., Simon, H., Venkatakrishnan, V., and Weeratunga, S., “The NAS Parallel Benchmarks (94),” NASA Ames Research Center, RNR Technical Report RNR-94-007, 1994, a host of commercial tools also provide advanced visualization capabilities for understanding the communications behavior. These tools include Vampir, IBM's VT, ANL's Upshot and Jumpshot, and ORNL's Paragraph.