The present invention relates generally to data processing storage systems comprising a local or local storage facility and two or more remote storage facilities that mirror at least certain of the data retained by the local storage facility. More particularly, the invention relates to a method, and apparatus implementing that method, to synchronize the data at surviving storage facilities in the event of failure of one of them.
The use of data processing over the years by commercial, military, governmental and other endeavors has resulted in tremendous amounts of data being stored—much of it virtually priceless because of its importance. Businesses, for example, risk collapse should its data be lost. For this reason alone the local data is backed up to one or more copies of the data, and retained for use should the original data be corrupted or lost. The more important the data, the more elaborate the methods of backup. For example, one approach to protecting sensitive or valuable data is to store backup copies of that data at one or more sites that are geographically remote from the local storage facility. Each remote storage facility maintains a mirror image of the data held by the local storage facility, and changes (e.g., writes, deletions, etc.) to the local data image of the local storage facility are transferred and also effected at each of the remote storage facilities so that the mirroring of the local data image is maintained. An example of a remote storage system for mirroring data at a local storage system is shown by U.S. Pat. No. 5,933,653.
Updates sent to the remote storage facilities are often queued and sent as a group to keep the overhead of remote copying operations at a minimum. Also, the transmission medium often used it an Internet connections or similar. For these reasons, the data images mirroring the local data will, at times not be the same. If more than one remote storage is used to mirror the local data, there often will be times when the data images of the remote storages will be different from one another—at least until updated by the local storage facility. These interludes of different data images can be a problem if the local facility fails, leaving the remote storage facilities. Failure of the local storage facility can leave some remote storage facilities with data images that more closely if not exactly mirror that of the local storage facility before failure, while others have older “stale” data images that were never completely updated by the last update operation. Thus, failure of the local storage facility may require the remote storage facilities to re-synchronize the data between them in order that all have the same and latest data image before restarting the system. There are several approaches to data synchronization.
If a removable media (e.g., tape, CD-R, DVD, etc.) is used at the local and remote storage facilities, such removable media can be used. For example, a system administrator will copy data at the remote storage facility believed to have the most up-to-date data image of the local facility to the tape. Then, in order to keep the data image from changing before it is used to synchronize at the other remote storage facilities, input/output (I/O) operations at the image-donating facility are halted until the tape can be circulated to update the other remote storage facilities. At the remote storage, administrator copies data from removable media to storage at the remote site. Then, the system administrator re-configures the entire system to that one of the formally remote storage facilities is now the new local storage facility, and its I/O operations allowed be commence. This approach is efficient when the data involved is small, but not so for larger systems. Larger systems will produce data that grows rapidly, requiring what could be an inordinate amount of time to copy for the entire synchronization process.
Lacking removable media, another approach would be to use any network connections between the various storage facilities to communicate data. This approach requires that one storage facility be selected to replace the former local (but now failed) storage facility. I/O operations at the selected storage facility is halted, for the same reasons stated above, and a re-synchronize copy process is initiated between the selected storage facility and the other remote storage facilities. When the re-synchronization process is complete, I/O operations are restarted at the selected storage facility, and the system proceeds as before, albeit with one less storage facility (the failed former local storage facility).
A major problem with this latter approach it the time needed for the re-synchronization process, particularly for larger amounts of data. For example, a storage of 100 terabytes (TB) of data, using 100 MB/s network transfer connection, will take approximately 11.57 days all the data. (100×1012/(100×106)=106 sec=277 hour=11.57 Days). This is the time for re-synchronization of just one storage facility. If re-synchronize is to be performed for more than one storage facility, the problem is exacerbated. Also, during the re-synchronization process, I/O operations of the storage facilities involved are halted.