A typical operating system includes a file system. The file system provides a mechanism for the storage and retrieval of files and a hierarchical directory structure for the naming of multiple files. More specifically, the file system stores information provided by the user (i.e., data) and information describing the characteristics of the data (i.e., metadata). The file system also provides extensive programming interfaces to enable the creation and deletion of files, reading and writing of files, performing seeks within a file, creating and deleting directories, managing directory contents, etc. In addition, the file system also provides management interfaces to create and delete file systems. File systems are typically controlled and restricted by operating system parameters. For example, most operating systems limit the maximum number of file names that can be handled within their file system. Some operating systems also limit the size of files that can be managed under a file system.
An application, which may reside on the local system (i.e., computer) or may be located on a remote system, uses files as an abstraction to address data. Conventionally, this data is stored on a storage device, such as a disk.
Data stored as files in a file system may be replicated using one or more replication schemes. Replication schemes are typically used to enable recover data in the event of file system failures, data corruption, etc. Data replication ensures continuous availability and protection of data stored on disk. The follow is a non-exclusive list of common replication schemes: redundant arrays of independent disks (RAID) schemes, 2-way mirroring, 3-way mirroring, etc. Typically, the level of granularity available for replication of data is a file.
There are many RAID schemes currently available. One common RAID scheme is RAID-5. In general, RAID-5 is used to replicate data across multiple physical disks organized in an array. More specifically, the physical disks in the data storage system are typically segmented into blocks of data space. A block may comprise any appropriate number of bytes of data (e.g., 512 bytes, 1024 bytes, etc.). In RAID-5, data to be stored is divided into data blocks and the resulting data blocks are XORed to obtain a parity block. The parity block corresponds to a block that is used to recover part of the data in the event that one of the aforementioned data blocks is corrupted or the disk, upon which the data block is stored, fails. The data blocks and the parity block are then written to the multiple disks by striping the data blocks across the multiple disks.
The following is a brief example, illustrating the operation of RAID-5. Initially, a request is received to write data to the disk. Assuming that there are five disks in the system, the data to be written is divided into data blocks. Further, one parity block is be created for each set of four data blocks. The four data blocks and the parity block correspond to a stripe. Once all the parity blocks have been created, the data blocks and the corresponding parity blocks are written to disk, in stripes, where each stripe spans the entire five disks and includes four data blocks and one parity block.
In the event that an entire stripe is not written to the disks (i.e., one or more data blocks or the corresponding parity block is not written to disk), then the parity block of the stripe will be inconsistent with the data blocks in the stripe. As a result, the data blocks in the stripe cannot be recovered using the parity block. The aforementioned issue, commonly known as a “write-hole,” has been addressed using hardware based solutions.
Continuing with the discussion of RAID schemes, to modify data already written to disk using RAID-5 replication, the old data block that is to be modified is XORed with the corresponding old parity block to obtain a delta block. The delta block is subsequently XORed with the old parity block to obtain a new parity block. Then, the new data block and the new parity block are written to disk. The aforementioned sequence generates two read operations (i.e., one read operation to read the old data block and one read operation to read the old parity block) and two write operations (i.e., one write operation to write the new data block and one write operation to write the new parity block).
In some instances, because the delta blocks are generated using only one of the data blocks striped across the multiple disks rather than all of the data blocks that correspond to the parity block, if one of the two write operations fails and either the new data block or the new parity block does not get written to a disk, then the modified data is not recoverable.