(1) Field of the Invention
This invention relates to a computer-readable recording medium storing a system analysis program, and an apparatus and method for system analysis, for managing a parallel computing system. In particular, this invention relates to a computer-readable recording medium storing a system analysis program, and an apparatus and method for system analysis, in order to detect abnormalities occurring in an operating parallel computing system.
(2) Description of the Related Art
Parallel computing systems are widely employed in many fields including Research and Development (R&D), High Performance Computing (HPC), and bioinformatics. “Parallel computing” is a computing system comprising a plurality of computers being connected over a network. The parallel computing systems include a cluster system and a grid computing system. In this connection, each computer of a parallel computing system is called a node.
Existing parallel computing systems include: (1) “personal computer (PC) cluster system (hereinafter, referred to as cluster)” that comprises a high-speed network and high performance PCs and is designed to mainly execute a single parallel program; and (2) “grid computing system (hereinafter, referred to as grid)” that uses a plurality of computers as one virtual computer according to user-required computing performance and storage capacity.
In such parallel computing systems comprising a great number of computing nodes, “program profiling (which is equivalent to a timer-based sampling or the like and is simply referred to as profiling hereinafter)” is performed for managing the operating conditions of the systems.
Profiling software measures system performance data from start to end of a target program, for example. The system performance data may be measured only during a prescribed time period while the target program runs.
FIG. 13 shows a prior art method of profiling. In FIG. 13, sampling is performed at every 1 ms and an address being accessed by a program at this time is recorded. An address record table 911 is a storage region for storing values indicating how many times addresses are sampled for each function.
The address record table 911 has columns for function name, address range, and sampling count. The function name column stores the names of functions to be executed. The address range column stores ranges of memory addresses to be specified when the functions are executed. The sampling count column stores the number of times where addresses are sampled, in association with the functions.
As can be seen from this example, eight samplings result in obtaining addresses: “0x05”, “0x11”, “0x13”, “0x23”, “0x11”, “0x23”, “0x23”, and “0x23”. The profiling function detects an address being accessed by a Central Processing Unit (CPU) at sampling. In addition, the profiling function determines based on the address record table 911 which set of function and address range the detected address belongs to. Then the profiling function increments the sampling count corresponding to the function that the detected address belongs to.
As a result, 8-ms sampling results in a sampling count of one for a function “Func A”, a sampling count of three for a function “Func B”, a sampling count of four for a function “Func C”, and a sampling count of zero for a function “Func D”. This measurement result shows that the function “Func C” occupied the longest run time (50% of the total time period).
In order to manage a parallel computing system, profiling is performed for each node in a manner shown in FIG. 13. Then a counting process is performed for each node. For example, there is provided a technique for obtaining performance information such as an actual computing time of each of processors composing parallel computers and displaying a circle graph or a radar chart (Japanese Laid-open Patent Publication No. 10-63550).
In addition, a study is made for performing statistical analysis on data of computing nodes in order to extract important features for performance evaluation (refer to Dong H. Ahn and Jeffrey S. Vetter, “Scalable analysis techniques for microprocessor performance counter metrics”, Proc.SC 2002).
However, the technique shown in Japanese Laid-open Patent Publication No. 10-63550, when applied to a computing system comprising a plurality of computers, such as cluster or grid, has the following two drawbacks.
1. Since a profiling result is output by taking an entire program as one measurement target, an analyst may miss very small changes in behavior which occurred when the program ran (which occurred in a very short time). This is because their signs are hidden behind all other data. Such changes may deteriorate the system performance in parallel processing.
2. An increased number of profiling data is collected in proportion to the number of computing nodes. Analysis using profiling requires extracting bottleneck processes by intercomparing the data between the computing nodes. However, it is substantially impossible to perform detailed profiling for thousands of nodes in order to detect very small changes in behavior.
It should be noted that the aforementioned “Dong H. Ahn and Jeffrey S. Vetter” reference does not mention a technique for collecting profiling data. Therefore, this “Dong H. Ahn and Jeffrey S. Vetter” reference does not contribute to solve the above two problems.