The present invention relates to a method of performance monitoring which captures and displays performance data in a computer system which connects two or more computers with a network, a computer system therefor and a program storage medium therefor, and especially to a performance monitoring method suitable for a parallel computer or a distributed system.
In a parallel computer or a distributed system, their operations become very complex compared with a sequential computer, because computers called nodes, which compose it, operate cooperatively in parallel, and the operations of the nodes depend on those of other nodes, as exemplified by internode-communication. In order to use such a parallel computer effectively and to induce enough performance of the parallel computer, it is necessary to grasp accurately not only the operation of each node but also the complex operation status of the parallel computer including the causal relation among operations of nodes and the balance of loads among the nodes, and to make use of that information to tune the programs being executed.
As for prior art which supports the grasping of operation status of a computer, the following two methods have been chiefly used. The first one is adopted, for example, by PerfView produced by Hewlett-Packard Co., and it measures performance data of each node of a distributed system related to the operation status of the node such as the operation status of CPU, the state of use of the memory, and the communication frequency of the network. The measured performance data is stored in the storage device in the node, such as a magnetic disk storage device and so on. The performance data stored in each node is further accumulated in one computer connected to the distributed system and is displayed graphically or so, to aid visual grasp of the performance data.
The second one is represented, for example, by Visualization Tool produced by IBM, a process which captures performance data is invoked on each node of a parallel computer, and a display process invoked on a controlling computer connected to the parallel computer through the network receives performance data from the capturing process of each node in real time and displays the performance data received from each node. For instance, refer to "IBM Parallel Environment for AIX Operation and Use Version 2.1.0," pp. 26-3265, 1995 (Document Number GC23-3891-00), issued by International Business Machines Corp.