In the recent times, the computer system has become necessary in the infrastructure of our daily life as a support for daily activities. The computer system is required to operate 24 hours a day, without interruption, to continuously provide its service. For example, an online banking system has a database task as an essential operation. The database system is not permitted to be halted so as to allow for continuous updating.
In a computer system requiring such a high reliability, which is not allowed to be halted for a moment, the system currently performing tasks (currently active computer system) typically has a backup computer system, which operates to stand by as a replacement to take over a job in an event that the active system experiences a failure. The failover procedure from the detection of a failure found in the active system up to the replacement with the backup, stand-by, system is provided by a cluster program. In order to take over the process, any data used in the application and operating system has to be carried over. For example, in the database system as described above, the information on the volume in which the data to be processed is stored must be carried over.
However, the detection of a fault in the cluster program is programmed so as to prevent the taking over of tasks due to a temporal error or a mistakenly detected failure, such that the tasks are taken over to the stand-by system only when the failure is detected repeatedly a given number of times. As a result, there exists a delay from the first detection of failure to the actual take-over of the system process. If the failure which has occurred in the active system is destructive due to a bug in an application or due to a runaway condition of the system, then some of the data will be destroyed during this period, so that the destroyed data will not be recoverable. Therefore, there is a problem that the stand-by system cannot efficiently take over the processing of the data.
A database system stores data in a volume (VOL) created on a disk for performing its processing. For this, a data protection technique, which involves creation of a replication of the VOL by the disk drive device, has been widely used. For example, JP-A No. 134456/2003 discloses a system in which two computer systems each have a disk drive device, both devices being interconnected to perform a replication of a VOL in either disk. In this system, the primary volume, which represents the source of data, is connected to the active computer system, and the secondary volume or the destination of the data is connected to the backup system independently, so that there must be a disk-to-disk communication procedure. In this system, there is a problem that the disk replication is not properly completed in case of a failure of abnormal disk-to-disk communication, not to mention the case of a bug in an application or a runaway condition.
U.S. Pat. No. 6,401,178 discloses a system in which data replication is performed within a disk drive. When applied to a VOL, a replica of the VOL is created in the disk so that no communication error exists between disks. The VOL replication procedure consists of a pair configuration and a pair split, aimed at a pair of VOLs of the primary VOL or the source of data and the secondary VOL or the destination. A pair configuration is a means for generating rapidly a secondary VOL or a replica of the primary VOL by synchronizing every data including the physical volume identifier (PVID) and the volume information. Thus, a pair is configured, the PVIDs of both primary and secondary VOLs are the same, and the VOLs are treated as one single VOL from the superior computer system. On the other hand, pair splitting is a process that rewrites the PVID of a secondary VOL to another PVID that is different from that of the primary VOL, for the paired VOLs. The paired VOL, which is seen as one single VOL from the superior computer system, can appeared as two separate VOLs in the pair split status. These two means provide for generation of a replication of a primary VOL and for providing a functionality of operating a thus created replica, a secondary VOL, from the computer system.
A method in which data to be taken over may be protected by shadowing the data from the primary vol to the secondary vol by using this volume replication function, in order to recover from the non-corrupted status, can be devised, however, there will be problems in such method. First, if the data replication is enabled, corrupted data may be copied to the destination so that the data in the secondary vol, the subject to be protected, may also be corrupted. Second, if the data capacity is huge in such a case as a database system, the backup process may take a few hours, resulting in a difficulty of frequent backup operations. In addition, there also is another problem in that the recovery of the data status at the time of failure from the backup data may involve a lot of time and effort.
The prior methods, as described above, have the following problems: When a system failure occurs in the active/backup computer systems having commonly shared primary and secondary VOLs that are subject to be replicated therebetween (pair configuration/pair split), and when the stand-by system takes over the process in the active system, if a failure that has occurred in the active computer system involves data corruption, the data to be taken over may be subject to the risk of corruption, so that the fail-over will have failed. This means that, in the failover system, the stand-by system takes over the process and protects data required for the failover by detecting the occurrence of a failure in the active computer system. If there is a failure involving data corruption, the failover works only after the data has been corrupted. When the data has been corrupted, valid data should be recovered from the backup. However, as the amount and capacity of data is continuously increasing, the complete backup interval is set longer, so that the time taken for the recovery of valid data immediately prior to a failure will be enormously long.
As can be seen from the foregoing, the methods employed heretofore have a problem of requiring an enormous time for recovering the same data along with a valid system in case of a failure that involves data corruption, when using a cluster system for the purpose of increased reliability.