The cluster service architecture (i.e. Microsoft Cluster Service; MSCS) provided by U.S. Microsoft corporation is directed to a solution for total fault-tolerant management with respect to platform resources, which not only can mange the fault-tolerant capability of application programs but also can manage disk drives, printers and other Microsoft software systems, such as SQL Server 2000 and Exchange Server 2000. When the node detection mechanism of MSCS applied in a relatively complicated cluster, all of the nodes therein will send periodic heartbeats to notify other nodes that “I am alive!”, thus resulting in heavier network burden.
With regard to existing patents, U.S. Pat. No. 6,636,982 entitled “Apparatus and method for detecting the reset of a node in a cluster computer system” provides a scheme regarding a process of adding load-balancing cluster nodes to a cluster environment. Among the nodes activated in the original cluster environment, one node acts as a master node (usually the node first activated is the master node). When there is a new node desired to be added to the cluster environment, the master node will determine if the new node is allowed to be added, the determining process thereof including verifying the network connection of the new node is consistent with its configuration, etc. If the new node is permitted to be added, the master node will command the new node to provide service together. This prior patent mainly provides a verification scheme of cluster nodes for ensuring the new nodes in the cluster environment can be added correctly. However, this prior patent fails to provide a heartbeat communication method among the nodes in the updated cluster environment after the new nodes are added in.
U.S. Pat. No. 6,502,203 entitled “Method and apparatus for cluster system operation” provides the concept of using a secondary channel, wherein nodes in a normal cluster environment issue heartbeats via a primary channel. When a node in the cluster is detected to be abnormal, a heartbeat will be sent via the second channel for further validation. If the results are the same for two channels, the lost node detected can then be confirmed to be abnormal. The main purpose of this prior patent is to prevent abnormal cluster operation due to heartbeat loss by using multiple communication channels. However, the method of this prior patent merely resolves the problem superficially but not fundamentally, and does not provide any substantial improvement on the conventional heartbeat communication method which is relatively complicated.
U.S. Pat. No. 5,502,812 entitled “Method and system for automatic fault detection and recovery in a data processing system” adds one or more backup elements for each member in a data-processing system, and uses the signal sent by a watchdog circuit to check if the member in execution is abnormal. If a fault occurs, the tasks undertaken are transferred to the backup elements for continuous execution. The prior patent mainly provides a redundancy mechanism for a single-unit hardware environment. However, the prior patent does not support distributed structures, and merely provides 1:1 backup support, but not 1:N backup support.
U.S. Pat. No. 6,212,649 entitled “System and method for providing highly-reliable coordination of intelligent agents in a distributed computing” discloses an intelligent agent to detect if information transmitted in a distributed system is correct. If a fault occurs, then the sending agent is asked to re-send the information, thereby promoting system reliability. However, if the receiving agent has the errors of such as system down, etc., the receiving agent cannot recover back to normal operation even if the information is re-sent. Further, the distributed object system built by applying the prior patent also lacks of the recovery function for faulty programs. Thus, when the programs in the system have errors, users cannot freely select other normal services in the system to replace the faulty programs.
Hence, there is an urgent need to develop a method for providing fault-tolerant application cluster service, thereby simplifying detection processes and achieving better fault-tolerant efficiency for application programs, further reducing network burden and improving the shortcomings of the conventional skills.