In a cluster system including a cluster operation node and a cluster stand-by node, a network monitor executed in the cluster operation node periodically transmits an existence confirmation message to a network relay apparatus, which is connected with, for example, the Intranet or Internet and relays a job request from other computers connected with the Internet or the like to the cluster operation node. When a transaction Local Area Network (LAN) between the network relay apparatus and the cluster operation node are normally functioning, the network relay apparatus replies a response to the existence confirmation message. However, when a communication failure occurs by a failure of a Network Interface Card (NIC) in the cluster operation node or the like, a failure of the network relay apparatus or the like, the response is not returned from the network relay apparatus. When no response can be obtained for the predetermined number of messages, from the network relay apparatus, the network monitor notifies a cluster manager executed in the cluster operation node of an occurrence of the network failure. Then, after the cluster manager in the cluster operation node stops a business application being executed in the cluster operation node, it causes the network monitor to deactivate an inherited IP address in use. Next, the network monitor in the cluster stand-by node activates the inherited IP address in response to a request from the cluster manager in the cluster operation node, and activates the business application in the cluster stand-by node. Thereby, the subsequent business processing is inherited by the cluster stand-by node.
Incidentally, JP-A-H04-291628 discloses a technique for automatically recovering a failure when the failure occurs in a composite subsystem controller in a composite-subsystem-type online system. Specifically, a controller monitor, which detected a failure occurrence of the controller, instructs a hot stand-by start when a stand-by job exists. When there is no stand-by job, the controller is activated after stopping all subsystems under the controller, and after a state is returned from the latest check point and journal information obtained after the check point, up to a state where a processor in the execution system was downed, the processing proceeds. Thus, when the failure occurred in the composite subsystem controller, because of the temporary and time-based reason, not the hardware reason, the failure can be automatically recovered and the processing can proceed. However, the failure occurrence in the network cannot be treated.
In the aforementioned conventional art, when the network failure is notified to the cluster manager in the cluster stand-by node, the cluster stand-by node is abandoned after that, that is, it becomes inoperable state. This inoperable state is also notified to the cluster manager of the cluster operation node. After that, when the network failure is notified to the cluster manager in the cluster operation node, it is supposed that there is no switching destination node, and the node switching is not carried out. In the cluster operation node, the business application is stopped, and the inherited IP address is deactivated. Therefore, at that timing, the business processing is stopped. When the occurred failure is a failure in the NIC of the cluster operation node or cluster stand-by node or in the hardware of the network relay apparatus or the like, the aforementioned control is proper.
However, when the load of the network relay apparatus becomes high due to the increase of the communication traffic in the transaction LAN, a response may not be temporarily replied to the existence confirmation message from the network monitor, or the response may be delayed. In addition, there are some network relay apparatuses in which the priority is assigned to the traffic, and which, in case of the high load state, controls to discard the traffic having the low priority and to process only the traffic having the high priority. Then, such a network relay apparatus may not respond to the existence confirmation message. Thus, in a case where the congestion temporarily occurs in the transaction LAN, it is expected that the congestion recovers to the state that the communication can be done, after some time passes. On the other hand, when the occurrence of the network failure is detected by the network monitor in both of the cluster operation node and the cluster stand-by node and is notified to the cluster manager, (1) immediately after the switching from the cluster operation node to the cluster stand-by node, the failure is also detected by the cluster stand-by node and the business application is stopped. Or (2) because the failure is detected at the cluster stand-by node side, the cluster stand-by node is abandoned, and the business application in the cluster operation node is stopped without carrying out the node switching.
Thus, because both of the cluster operation node and cluster stand-by node become abnormal in the resource, there is a problem that the business processing cannot be recovered unless an operator operates the cluster manager in each node from a management console to reactivate the business application, even when the network is recovered after that. In addition, there are problems that, before the reactivation of the business application, it is necessary to collect data to investigate the reason why the stop of the business processing occurred and to carry out check operation to confirm whether or not the restart of the business processing can be carried out without any problem, and it takes time and job.
Thus, when the network failure temporarily occurs due to the increase of the communication traffic in the network, there is no guarantee of the business processing continuity even if, by the node switching, the business processing is inherited from the cluster operation node to the cluster stand-by node. In addition, when the maintenance operation such as update of the firmware in the network relay apparatus is carried out, because, also in case where an operation mistake that the reboot of the network relay apparatus is mistakenly carried out without stopping the monitor by the network monitor, both of the cluster operation node and the cluster stand-by node are stopped, a problem that it takes time and job for the reboot occurs.
Normally, the node switching control when a fatal error occurs in the application operating in the cluster system depends on that cluster system. When the error occurs, the network monitor merely notifies the cluster manager of the error, and it is not considered whether or not the business processing can continue in other nodes. In addition, although an interface (commands, Application Program Interface (API) or the like) to refer to the state of the application in each node and to judge whether or not the node has already been in the inoperable state is provided in the normal cluster system, it is impossible to correctly judge whether or not the business processing can continue in other nodes when the error is almost simultaneously detected in each node.