1. The Field of the Invention
The present invention relates to data storage associated with computers and data processing systems. Specifically, the present invention relates to methods used to recover from a computer failure in a system having a plurality of computer systems, each with its own mass storage device.
2. The Prior State of the Art
Computer networks have greatly enhanced mankind""s ability to process and exchange data. Unfortunately, on occasion, computers partially or completely lose the ability to function properly in what is termed a xe2x80x9ccrashxe2x80x9d or xe2x80x9cfailurexe2x80x9d. Computer failures may have numerous causes such as power loss, computer component damage, computer component disconnect, software failure, or interrupt conflict. Such computer failures can be quite costly as computers have become an integral part of most business operations. In some instances, computers have become such an integral part of business that when the computers crash, business operation cannot be conducted.
Almost all larger businesses rely on computer networks to store, manipulate, and display information that is constantly subject to change. The success or failure of an important transaction may turn on the availability of information which is both accurate and current. In certain cases, the credibility of the service provider, or its very existence, depends on the reliability of the information maintained on a computer network. Accordingly, businesses worldwide recognize the commercial value of their data and are seeking reliable, cost-effective ways to protect the information stored on their computer networks. In the United States, federal banking regulations also require that banks take steps to protect critical data.
One system for protecting this critical data is a data mirroring system. Specifically, the mass memory of a secondary backup computer system is made to mirror the mass memory of the primary computer system. Write requests executed in the primary mass memory device are transmitted also to the backup computer system for execution in the backup mass memory device. Thus, under ideal circumstances, if the primary computer system crashes, the backup computer system may begin operation and be connected to the user through the network. Thus, the user has access to the same files through the backup computer system on the backup mass memory device as the user had through the primary computer system.
However, the primary computer system might crash after a write request is executed on the primary mass memory device, but before the request is fully transmitted to the backup computer system. In this case, a write request has been executed on the primary mass memory device without being executed on the backup mass memory device. Thus, synchronization between the primary and backup mass memory devices is lost. In other words, the primary and backup mass memory devices are not perfectly mirrored, but are slightly different at the time of the crash.
To illustrate the impact of this loss in synchronization, assume that the primary and backup mass memory devices store identical bank account balances. Subsequently, a customer deposits money into an account and then shortly thereafter changes his mind and withdraws the money back from the account. The primary computer system crashes just after the account balance in the primary mass memory device is altered to reflect the deposit, but before the write request reflecting the deposit is transferred to the backup computer system. Thus, the account balance in the backup mass memory device does not reflect the deposit. When the customer changes his mind and withdraws the money back out from the account, the account balance in the backup memory device is altered to reflect the withdrawal. When the primary computer system is brought back into operation, the account balance from the backup mass memory device is written over the account balance in the primary mass memory device. Thus, the account balance reflects the withdrawal, but does not reflect the deposit.
Another disadvantage of this system is that when that primary computer system is brought back into operation, the entire backup mass storage device is copied back to the primary mass storage device in what is termed a xe2x80x9cremirrorxe2x80x9d. The copying of such large amounts of data can occupy a significant time and be disruptive to transactional operations.
Therefore, a backup computer system and method are desired that do not result in the above-described loss of synchronization, and that do not require a complete remirror.
In accordance with the present invention, a method and system are provided in which data from a primary computer system is mirrored in a secondary backup computer system. This system maintains complete synchronization between the primary and backup memory devices even should the primary computer system fail after a write request was executed in the memory of the primary computer system, but before the request is fully transmitted to the backup computer system.
For each write request, a copy of the request is written into a delay buffer associated with the primary computer system, and a copy is transmitted to the backup computer system. After the write request has been fully transmitted to the backup computer system, the backup computer system informs the primary computer system (e.g., by sending an acknowledgement signal) that the request has been received at the backup computer system. The write request in the delay buffer of the primary computer system is executed only after the primary computer system receives the acknowledgement signal indicating that the backup computer system also received a copy of the write request. Thus, if the primary computer system fails before a copy of the write request is transmitted to the backup computer system, the primary computer system will not have executed the write request since the write request was left unexecuted in the delay buffer. Therefore, synchronization is not lost between the primary and backup computer systems.
Another advantage of this invention is that complete remirroring (i.e., recopying) of data from the backup computer system to the primary computer system is not needed when the primary computer system is brought back into operation after a failure. Both the primary and backup computer systems have a memory queue to which a copy of the write request is forwarded. When the primary computer system determines that the write request has been executed in the memory device of the backup computer system, the primary computer system deletes that request from its memory queue. Likewise, when the backup computer system determines that the primary computer system has executed the write request, the backup computer system deletes the write request from its memory queue. Thus, the memory queue includes write requests which have been generated, but which are not confirmed to have been executed by the opposite computer system.
Should the opposite computer system experience a failure, the memory queue will accumulate all the write requests that need to be executed within the failed computer system to once again mirror the memory of the operational computer system. Only the write requests in the memory queue, rather than the entire memory, are forwarded to the failed computer system once it becomes operational. Thus, complete remirroring is avoided.
Additional objects and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objects and advantages of the invention may be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. These and other objects and features of the present invention will become more fully apparent from the following description and appended claims, or may be learned by the practice of the invention as set forth hereinafter.