In recent years, the advent of cloud computing leads to an issue how to guarantee the quality of a service provided to a user by a provider, namely, how to keep SLA (Service Level Agreement). Therefore, a cluster system configured by a plurality of nodes as shown in Patent Document 1 has a redundant configuration to, even when a node providing a service comes into a down state, take over the service to another node, thereby guaranteeing the quality of the service. On the other hand, it is also an issue for clusterware installed in such a cluster system how quickly and accurately grasp the states (operation statues, or whether a fault has occurred or not) of the nodes in order to realize higher SLA.
Alive monitoring of nodes in a cluster system is performed in a manner that the nodes check the operation states each other by using, as a communication path, something that enables the nodes to exchange information, such as a LAN (Local Area Network), serial ports or a shared disk. In a case that it is impossible to perform communication with a certain node for a given time or more, the certain node is judged to be in the down state.
[Patent Document 1] Japanese Unexamined Patent Application Publication No. JP-A 2006-79161
However, in the abovementioned method, a LAN, serial ports, a shared disk and so on are all controlled as management resources of an OS (Operating System), and therefore, are affected by the operation state of the OS, other than a physical fault of a communication path. For example, in a case that the OS of a specific node comes into a high load condition and cannot perform communication with other nodes for a given time or more, the specific node is considered to be in the down state by the other nodes though the node is not down actually.
Further, in a case that a node goes down due to a hardware fault or the like, it takes a specific time or more before the node is judged to be in the down state, and therefore, it is impossible to instantly execute system switching. For example, when power interruption is caused by a CPU (Central Processing Unit) fault and one node comes into a down state, it takes a specific time or more before another node judges the one node to be in the down state.
Thus, there is a problem for a cluster system that it is impossible to accurately and rapidly grasp the statuses of nodes, and therefore, it is impossible to rapidly switch the nodes and the reliability of the system decreases.