Information drives business. Companies today rely to an unprecedented extent on online, frequently accessed, constantly changing data to run their businesses. Unplanned events that inhibit the availability of this data can seriously damage business operations. Additionally, any permanent data loss, from natural disaster or any other source, will likely have serious negative consequences for the continued viability of a business. Therefore, when disaster strikes, companies must be prepared to eliminate or minimize data loss, and recover quickly with useable data.
Replication technology is primarily used for disaster recovery and data distribution. Periodic replication is one technique utilized to minimize data loss and improve the availability of data in which a point-in-time copy of data is replicated and stored at one or more remote sites or nodes. In the event of a site migration, failure of one or more physical disks storing data, or failure of a node or host data processing system associated with such a disk, the remote replicated data copy may be utilized. In addition to disaster recovery, the replicated data enables a number of other uses, such as, for example, data mining, reporting, testing, and the like. In this manner, the replicated data copy ensures data integrity and availability. Additionally, periodic replication technology is frequently coupled with other high-availability techniques, such as clustering, to provide an extremely robust data storage solution.
Performing a replication operation, backup operation, or the like on a large data set may take a significant amount of time to complete. The sheer size of the data set makes a replication operation take a significant amount of time. During this time, if the data set is maintained live, a problem with intervening accesses to the data set will have to be addressed. For example, on a large enterprise class system, there may be thousands of writes to that data set while it is being backed up or replicated. This factor can create data corruption hazards.
One approach to safely backing up live data is to temporarily disable write access during the backup, for example, by configuring a locking API provided by the file system to enforce exclusive read access. Such an approach might be tolerable for low-availability systems (e.g., desktop computers and small workgroup servers, where regular downtime is acceptable). Enterprise class high-availability 24/7 systems, however, cannot bear service stoppages.
A snapshot, or checkpoint, operation is often used to avoid imposing downtime. To avoid downtime, a high availability system may instead perform the replication or backup on a snapshot, which is essentially a read-only copy of the data set frozen at a point in time, and allow applications to continue writing to their data. Thus the term snapshot is used to refer to the data as they were at a particular point in the past.
Data storage required for applications such as file systems and databases are typically allocated from one or more storage devices that are maintained as a “volume”. The “volume” may serve as a logical interface used by an operating system to access data stored on one or more storage media using a single instance of a file system. Thus, a volume may act as an abstraction that essentially “hides” storage allocation and (optionally) data protection/redundancy from the application. An application can store its data on multiple volumes. The content of a volume is accessed using fixed sized data units called blocks.
Applications such as file systems and databases cannot be mounted on the replica volumes while these volumes are being synchronized since the synchronization process changes the volume blocks without the knowledge of the applications. If the data read into memory by applications becomes inconsistent with the on-disk image updated by the synchronization process, the applications will treat these volumes as corrupted. If the replica volumes are writable then the application and synchronization process can update the same block independently, which leads to real data corruption. For this reason, applications are mounted on frozen images (i.e., snapshots) of the replica volumes.
Traditionally, the applications on the secondary site have to wait for the replica to be fully synchronized to the secondary site before using the replica. One way of implementing periodic replication is to take the snapshots of the volumes periodically on the primary site and replicate these snapshots to the secondary site. When the snapshot is fully replicated, the applications can be mounted. For very large replicas, the time lag can be significant enough to lead to idling resources and delaying services at the secondary site, and possibly losing revenue opportunities. Therefore, it is very desirable to have a method that allows applications to be mounted on the replicated snapshot volumes as quickly as possible even if these snapshot volumes are not fully replicated. What is needed is a method to reduce the time required to make snapshot volumes available to secondary sites.