Large-scale systems including many nodes are being implemented in recent years. A large-scale system of this type includes three types of nodes: node that executes calculation processing as commanded by the user (this node will be referred to below as the calculation node), node that operates as a file server or a database (DB) server for the calculation server (this node will be referred to below as the input-output (IO) node), and node that manages the entire system (this node will be referred to below as the management node).
One of the important roles of the management node is to monitor abnormal conditions in calculation nodes and IO nodes and, if there is an abnormal condition, to execute processing to deal with the abnormal condition. In a general monitoring method, nodes to be monitored (calculation nodes and IO nodes) and a management node exchange an existence-confirming message with each other at intervals of a predetermined time.
If, however, many nodes are to be monitored, the processing load on the management node becomes large. A possible way to reduce the processing load on the management node is to share the nodes to be monitored among a plurality of management nodes. It is also effective to reduce a processing load on each message.
If, however, there are many nodes to be monitored, even a technology as described above is unable to be said to be sufficient in reduction of a processing load applied to the entire system to detect abnormal conditions. Related art is disclosed in, for example, Japanese Laid-open Patent Publication No. 2000-187598, Japanese Laid-open Patent Publication No. 10-049507, and International Publication Pamphlet No. WO2014/103078.