Applications access data that is stored on a persistent storage device, such as a disk drive or an array of disk drives. Redundant data is often maintained in order to provide improved availability and/or performance. For example, multiple identical copies (also called mirrors or plexes) of the same data are often maintained. While multiple copies of the data are available, an application's accesses to the data can be interleaved across multiple copies, providing increased access performance. If one copy of the data is corrupted, an application can continue to access the data in one of the other copies. As another example, parity information can be calculated for selected subsets of the data. If a subset of the data is corrupted, the remaining data is used, in conjunction with the parity information, to reconstruct the corrupted data.
A number of Redundant Array of Independent Disk (RAID) levels have been defined, each offering a unique set of performance and data-protection characteristics. RAID techniques are implemented on both physical storage devices and logical storage devices (referred to herein as volumes). Most RAID levels, such as RAID 1-6, maintain redundant data, either in the form of mirrors or parity. For example, RAID 1 provides one or more mirrored copies. Among the RAID configurations that use parity, RAID 2 uses a complex Hamming code calculation to generate the parity data, and consequentially RAID 2 is not typically found in commercial implementations. RAID levels 3, 4 and 5 are, by contrast, often implemented. Each of RAID levels 3, 4, and 5 uses an exclusive-or (XOR) calculation to generate parity data. RAID 3 distributes bytes across multiple disks and calculates parity from related groups (referred to as stripes) of bytes. RAID 4 and RAID 5 arrays compute parity on an application-specific block size, called an interleave or stripe unit, which is a fixed-size data region that is accessed contiguously. All stripe units in the same stripe (i.e., all stripe units at the same depth or altitude on each drive) are used to compute the parity. RAID 4 stores parity on a single disk in the array, while RAID 5 removes a possible bottleneck on the parity drive by rotating parity across all drives in the set.
In order for redundant data to provide protection against the failure of original data, when the redundant data is created, the redundant data must be synchronized with the application data. In the situation in which an application can access any one of several mirrors, it is also important that the data stored in the mirrors be synchronized in such a way that a read request can be satisfied from any one of the mirrors. Consistency between mirrors is maintained by having write operations write data to all mirrors (usually concurrently), and only allowing the write operation to complete when all of the mirrors have been updated with the new data.
Redundant data must be synchronized with the original data in a volume, even if some regions of the volume have never been written by the application. For a mirrored volume, this synchronization can be provided by copying the content of one mirror to the rest of the mirrors when the mirrored volume is created or when a new mirror is added to the volume. For a volume that includes parity, the synchronization process involves computing one or more parity values from the application data within the volume. This synchronization process puts a heavy I/O load on the system, since both reading and writing of the volume are involved. If parity has to be calculated, additional CPU and memory resources are consumed as well. Another option for volume initialization involves initializing the volume by writing all zeros to the entire volume. While this option is less I/O intensive (because volume contents are not read), the entire volume must still be written. Accordingly, techniques are desired to reduce the amount of computing resources and effort needed to initialize a volume that includes redundant data.