Applications access data that is stored on a persistent storage device, such as a disk drive or an array of disk drives. Redundant data, such as a duplicate copy of the application data or parity associated with the application data, is often maintained in order to provide improved availability and/or performance. For example, a set of parity information that can be used to recover application data within a volume is often maintained as part of the volume. The parity information can be maintained according to one of several different Redundant Array of Independent Disk (RAID) techniques. For example, RAID 5 arrays compute parity on an application-specific block size, called an interleave or stripe unit, which is a fixed-size data region that is accessed contiguously. All stripe units in the same stripe (i.e., all stripe units at the same depth or altitude on each drive) are used to compute a respective parity value. RAID 5 rotates the storage of the parity values across all drives in the set.
Another example of redundant data is a duplicate copy. Duplicate copies of the same data are often stored in the form of mirrors (e.g., according to RAID 1 techniques). While multiple copies of the data are available, an application's accesses to the data can be interleaved across multiple copies, providing increased access performance. If one copy of the data is damaged, an application can continue to access the data in one of the other copies and at the same time recover from the loss of redundancy.
A duplicate copy of the data can be maintained at a remote site (such a duplicate copy is referred to as a replica), which is typically geographically separated from the location of the data, in order to protect against data loss due to disaster at one site. If the data at one site is corrupted or lost, the copy of the data at the other site can be used. Hence, redundant data (e.g., in the form of a duplicate copy) can be located on the same site as the primary data or on a separate site than the primary data.
When redundant data is created, the redundant data needs to be synchronized with the original application data in the volume. This process is called initial synchronization. The redundant data is considered to be synchronized with the original when the data in the redundant data provides either a full copy of a valid state of the original data volume or information (like parity) that can be used to recover the valid state of the original data volume at a given point in time. Many times, redundant data is created after an application has already begun using the original volume. The redundant data can be synchronized by accessing data from a backup or other point-in-time copy of the original volume, or by accessing data directly from the original volume itself.
After the initial synchronization, a process operates to maintain synchronization between the redundant data and the original. For example, if the redundant data is a replica (i.e., a duplicate copy maintained at a remote location), a replication process tracks application writes to the original and routinely applies these application writes to the replica. Similarly, if the redundant data is a mirror, a process ensures that a write to the original does not complete until the write has also been applied to the mirror. If the redundant copy is a set of parity information (e.g., a parity column in RAID 5), a process ensures that a write to the original does not complete until an appropriate parity value within the set of parity information has been recomputed.
The initial synchronization process typically consumes a large amount of time and/or resources. For example, when a replica is created at a remote site, a tape backup of the original is transported to the remote site and then copied to the replica. Due to transportation delays, it may take several days before the replica is initially synchronized with respect to the primary. Alternatively, if the data is copied to the replica via a network, the initial synchronization can consume an enormous amount of network capacity and time. Initial synchronization of RAID volumes requires additional I/O and CPU cycles in order to calculate parity values. It is desirable to be able to reduce the amount of time and/or resources required to initially synchronize redundant data with an original volume.