Applications run on today's massively parallel supercomputers are often guided toward scalable performance on thousands of processors with performance analysis tools. However, conventional tools for parallel performance analysis have serious problems due to the large data volume. For example, tracing tools like Paraver collect a sequence of time-stamped events in a program. For MPI tracing, Paraver intercepts MPI calls and saves individual trace files during application execution. The individual files are then merged, and the merged trace file is displayed using a viewer, which has many display and analysis features. Due to the cost to collect, store and transfer the performance data, Paraver is best suited for parallel applications at a modest scale rather than large systems such as IBM's Blue Gene/L systems. This basic difficulty affects all tracing tools because at large processor counts the trace files become large and difficult to work with.
On the other hand, profiling tools like mpiP only collect timing summaries. mpiP collects cumulative information about MPI functions. Since it collects only cumulative information, the output size is very small compared to MPI tracing tools, and the execution time overhead is normally small. However, the detailed time history of communication events is not available with this tool. As shown in above examples, current tools either collect all the performance information resulting in a large overhead or collect summarized information but not enough details for analysis.
Conventional performance analysis tools usually record performance information that is pre-defined by the tool developers. The performance analysis tools usually collect all possible primitive performance metrics such as time and bytes. In addition to that, the tool developer also has to “imagine” derived performance metrics (e.g., bandwidth: bytes/seconds) based on those primitive performance metrics measured. This mechanism works well with a small scale of computations with limited number of processors. However, lack of flexibility makes the scalability a critical issue as the computation scale (e.g, number of processors) goes up.
The overhead of the information collection activities in a performance analysis tool can dominate the system resource usage (e.g., memory used for tracing). The transition (e.g., through the network) and the storage for the collected performance information is another challenge. These overheads can make the performance analysis tools impractical (e.g., too much memory space required for detailed tracing). Even if the system has sufficient resources to handle these issues, it will be difficult to analyze the extraordinarily large amount of performance information collected.
The work flow of a typical existing MPI tracing tool is generalized in FIG. 1. In this approach, each invocation of MPI functions by the application (1001) is replaced with a call to a routine in the MPI performance tool library, which intercepts the MPI function call and do some bookkeeping in a “trace buffer” that resides in the memory. Then the tracing tool uses PMPI interface to call the actual MPI library (1003). When the application execution finishes, the tracing library outputs (1002) performance/tracing data that satisfy certain pre-defined metrics (1005 and 1006). The tools following this framework lack flexibility and often scale poorly if collecting excessive amount of tracing data.