A storage server is a special-purpose processing system used to store and retrieve data on behalf of one or more client processing systems (“clients”). A storage server can be used for many different purposes, such as to provide multiple users with access to shared data or to backup mission critical data.
One example of a storage server is a file server. A file server operates on behalf of one or more clients to store and manage shared files in a set of mass storage devices, such as magnetic or optical storage based disks or tapes. The mass storage devices may be organized into one or more volumes of Redundant Array of Inexpensive Disks (RAID), sometimes known as “Redundant Array of Independent Disks.” Another example of a storage server is a device which provides clients with block-level access to stored data, rather than file-level access, or a device which provides clients with both file-level access and block-level access.
In a large scale storage system, it is inevitable that data will become corrupted from time to time. Consequently, virtually all modern storage servers implement techniques for protecting the stored data. Currently, these techniques involve calculating parity data and storing the parity data in various locations. For example, each parity datum may be an exclusive-OR (XOR) of data blocks. The data blocks may be stored in a “stripe” spread across multiple disks in an array. In a single parity scheme, e.g. RAID-4 or RAID-5, each stripe of data is protected by a single parity datum. Accordingly, each data block in the stripe is protected by a single parity datum. In a dual parity scheme, e.g. RAID-6 or RAID Double Parity (RAID-DP), a technique invented by Network Appliance Inc. of Sunnyvale, Calif., each data block is protected by two parity datum. The second parity datum may be a mirror of the first parity datum or a XOR of a different set of data blocks, for example.
The parity protection schemes described above provide some data protection yet have several disadvantages. For example, in the above schemes, Write operations have significant overhead. In RAID-5 schemes, for example, small writes often require two reads and two writes. For example, under one RAID-5 scheme, an existing parity block is read. An existing data block where the new data is to be written is also read. The existing parity block and existing data block are XOR'ed, and the result XOR'ed with the new data to arrive at a new parity. The new data and the new parity are written to the storage devices. Thus, two reads (one of the existing parity block and one of the existing data block) and two writes (one of the new data and one of the new parity) are required. This process is sometimes referred to as “Read Modify Write.” While the two Read operations may be done in parallel, as can the two Write operations, modifying a block of data in a RAID 5 system may still take substantially longer than in a system which would not require a preliminary Read operation. In some systems, the preliminary Read operation requires the system to wait for the storage devices (e.g. disk drives) to rotate back to a previous position before performing the Write operation. The rotational latency time alone can amount to about 50% of the time required for a data modification operation. Further, two disk storage units are involved for the duration of each data modification operation, limiting the performance of the system as a whole.
To avoid reading before writing under a RAID-5 scheme, an entire stripe has to be written, including the recalculated parity. This process is sometimes referred to as “Full Stripe Write.” Other write operations may read data blocks not being written (e.g. “Partial Stripe Write”). Yet others may eliminate some reads and writes of parity but still require one read and write for each data drive and one read and write for the parity drive (e.g. “Combined Read Modify Write” which writes to more than one data block in the stripe).
Additionally, under the RAID schemes described above, an array of storage devices may operate in a “degraded mode.” Operation in this mode may occur when a system operates with a failed storage device. The system ignores the failure and continues to read and write to the remaining devices in the array if possible (e.g. when only one device has failed in a single drive protection scheme). Performance suffers because, when a block is read from the failed drive, all other blocks in the stripe must be read. These other blocks are used to reconstruct the faulty or missing block. All blocks on the failed drive must be reconstructed using parity data and the other data.
In the schemes above, because Writes to a failed device or member (e.g. a disk drive) require accessing to all other devices in the array, if a media failure occurs during operating in degraded mode, data may be lost. For example, if a media error occurs in a RAID-5 array operating with a failed device, the data blocks in both the failed device and the device having the media error cannot be recovered. Encountering media errors while reconstructing a failed device is a common problem with RAID-5 arrays. Therefore, although data is still available in degraded mode, the array is vulnerable to failure if one of the remaining devices fails.
Therefore, what is needed is an improved technique for protecting data in a storage system.