Information drives business. Companies today rely to an unprecedented extent on online, frequently accessed, constantly changing data to run their businesses. Unplanned events that inhibit the availability of this data can seriously damage business operations. Additionally, any permanent data loss, from natural disaster or any other source, will likely have serious negative consequences for the continued viability of a business. Therefore, when disaster strikes, companies must be prepared to eliminate or minimize data loss, and recover quickly with useable data.
Periodic replication is one technique utilized to minimize data loss and improve the availability of data in which a replicated copy of data is distributed and stored at one or more remote sites or nodes. In the event of a site migration, failure of one or more physical disks storing data, or failure of a node or host data processing system associated with such a disk, the remote replicated data copy may be utilized. In this manner, the replicated data copy ensures data integrity and availability. Periodic replication is frequently coupled with other high-availability techniques, such as clustering, to provide an extremely robust data storage solution.
Performing a replication operation, backup operation, or the like on a large data set may take a significant amount of time to complete. The sheer size of the data set makes a replication operation take a significant amount of time. During this time, if the data set is maintained live, a problem with intervening accesses to the data set will have to be addressed. For example, on a large enterprise class system, there may be thousands of writes to that data set while it is being backed up or replicated. This factor can create data corruption hazards.
One approach to safely backing up live data is to temporarily disable write access during the backup, for example, by configuring a locking API provided by the file system to enforce exclusive read access. Such an approach might be tolerable for low-availability systems (e.g., desktop computers and small workgroup servers, where regular downtime is acceptable). Enterprise class high-availability 24/7 systems, however, cannot bear service stoppages.
A snapshot, or checkpoint, operation is often used to avoid imposing downtime. To avoid downtime, a high availability system may instead perform the replication or backup on a snapshot, which is essentially a read-only copy of the data set frozen at a point in time, and allow applications to continue writing to their data. Thus the term snapshot is used to refer to a copy of a set of files and directories as they were at a particular point in the past.
When a file-system checkpoint is taken, updates to the file-system will have copy-on-write (COW) overheads, which are very costly. COW overhead results from applications being allowed to continue writing to their data, as described above. These overheads may be acceptable in many cases because the file-system checkpoints may be short-lived. Such checkpoints may be deleted after some time, in which case file system performance suffers only for a short acceptable duration. If the checkpoints are long-lived, only the first write for a region will incur COW overheads. However, there are many applications and uses which require creating a new checkpoint every 15-30 minutes. In such a scenario, the production file-system is operating under checkpoint for virtually all its life, but at the same time, it is operating under a new checkpoint after every 30 minutes. Thus, if a region gets updated, for example, every 30 minutes, the update will incur a new COW overhead each time. For these reasons, many conventional file systems are significantly impacted when performing frequent replications.