The present invention relates to a high availability (HA) computer system, in which one of two server computers carries out a process as a master server computer and the other server computer takes over the process when a fault occurs in the master server computer. Preferably, the present invention relates to a method to determine a server computer which executed the process most recently, when a server computer is restored from a fault.
Various kinds of cluster type fault tolerant computer systems have been developed since before. Generally, this cluster type fault tolerant computer system is constructed by connecting a plurality of server computers (hereinafter referred to as server), for example, two servers through a network or the like. A feature of this type computer system is that even if a fault occurs in a server, the other server takes over a process (service) halted due to the fault in order to maintain availability of the entire system. Thus, this type computer system is called HA (high availability) computer system.
Some kind of the HA system includes a shared storage unit such as a shared disk drive. In this computer system, generally, the shared storage unit contains information necessary for taking over a process from a given server to the other server when one server is carrying out the process. In such a computer system, when faults occur in both two servers and then the both servers are restored from the faults or any one server is restored from the fault, any server restored from the fault is capable of taking over the process easily by using the aforementioned information stored in the shared storage unit.
However, some HA computer system does not have the shared storage unit. In this computer system, generally when one server is carrying out the process, that server sends information necessary for taking over the process from that given server to the other server in order to enable taking over of the process between the servers. Consequently, if a fault occurs in a server which is carrying out a process so that it becomes incapable of continuing the process, the other server is capable of taking over that process by using the information received from the one server up to then, that is, the process can be handed over from the one server to the other one.
However, it is not easy to hand over the process from one to the other when faults occur in both of the servers at the same time. The reason is that, for example, when both of the servers are restored from the fault, which server should continue that process must be determined. Further, when any one of them is restored from the fault, whether or not the restored server should take over the process is determined. This kind of the conventional technology will be described below.
When faults occur in two servers and after that, both of the servers are restored from the fault, for a server (share server) which is off the process to then to take over the process, the slave server needs to be given information for taking over the process from a server (master server) which carried out the process to then. However, if the slave server is already in fault before a fault occurs in the master server which carried out process most recently, the master server does not send information for handing over the process to the slave server. In this case, the slave server cannot taking over the process. Thus, when the both servers are restored from the fault, it is necessary to determine (select) a server which carried out the process most recently as a server which should take over the process.
On the contrary, if information for taking over the process is sent from the master server to the slave server before a fault occurs in the master server which carried out the process most recently, it looks as if any server is capable of continuing the process when both of the servers are restored from the fault. However, if the process which the master server carried out just before a fault occurs is a process for sending information necessary for taking over the process to the slave server, there is a possibility that the fault may have occurred before sending of that information is completed. Considering such a possibility, it is necessary to select the server which carried out the process most recently in this case also. Further, if any one of the two servers is restored from the fault, generally, a condition which allows that server to take over the process is that the server executed the process most recently. The reason why this condition is employed is the same as when both of the servers are restored from the fault.
For the reason described above, conventionally, any one of the following two methods have been employed in order to determine a server which carried out the process most recently.
(1) Method in which taking-over of the process is limited to once
Preliminarily, one of the two servers is set up to primary server while the other one is set to secondary server. Then, first, the operation is started with the primary server as a master and the secondary server as a slave. Here, the master carries out a process requested by a client (client computer) and sends information necessary for the taking-over to the slave. The slave receives the information for the taking-over sent from the master and stores it in its local external storage unit such as a disk drive unit. In this case, if the secondary server accepts taking over of the process because a fault occurs in the primary server, even if the primary server is restored from the fault, the primary server is not used as a slave. That is, the taking-over of the process is limited to once. In this case, if the secondary server is made to store whether or not it carried out the process in its own external storage unit, it is possible to determine which server carried out the process most recently. However, according to this conventional method, the taking-over of the process is limited to once. Thus, this method is not capable of achieving automatic operation in which the process is continued as long as possible even if a fault occurs in one or both of the servers or one or both of the servers are restored from the fault at any time.
(2) Method which uses time information
In this method, clocks (time) of two servers are set up preliminarily. When the server starts a process, a current time is stored in the external storage units which they provide. Consequently, by sending and receiving time information stored in the external storage units when both of the servers are restored from the fault, through a network, it is possible to determine a server which has newer time information to be a server which carried out the process most recently. This method using time information is on an assumption on time which has a global meaning or that the clocks of the respective servers are always synchronous with each other. However, the actual clocks are not always synchronous and therefore, this method has a problem in its determination accuracy. Further, if only one server is restored from a fault, the server is not capable of determining whether it carried out the process most recently, because it is not capable of sending or receiving time information to/from the other server.
In the above described conventional HA computer system in which one of two servers carries out a process and if a fault occurs in the one server, the other server is capable of taking over the process, because no shared storage unit is provided, “a method in which the taking over of the process is limited to once” or “a method using time information” is employed as a method for determining a server which carried out the process most recently. However, the method in which the taking over of the process is limited to once has such a problem that the automatic operation is disabled because the taking over of the process can be conducted only once. On the other hand, the “method using time information” has such a problem in the determination accuracy for time. Further, there is also a problem that if only one server is restored from a fault, that server is not capable of determining whether or not it is a server which carried out the process most recently.