Various forms of network storage systems are known today. These forms include network attached storage (NAS), storage area networks (SANs), and others. Network storage systems are commonly used for a variety of purposes, such as providing multiple users with access to shared data, backing up critical data (e.g., by data mirroring), etc.
A network storage system can include at least one storage server, which is a processing system configured to store and retrieve data on behalf of one or more storage client processing systems (“clients”). In the context of NAS, a storage server may be a file server, which is sometimes called a “filer”. A filer operates on behalf of one or more clients to store and manage shared files in a set of mass storage devices, such as magnetic or optical disks or tapes. The mass storage devices may be organized into one or more volumes of a Redundant Array of Inexpensive Disks (RAID). Filers are made by Network Appliance, Inc. of Sunnyvale, Calif. (NetApp®).
In a SAN context, the storage server provides clients with block-level access to stored data, rather than file-level access. Some storage servers are capable of providing clients with both file-level access and block-level access, such as certain Filers made by NetApp.
In a large scale storage system, it is inevitable that data will become corrupted or stored incorrectly from time to time. Consequently, virtually all modern storage servers implement various techniques for detecting and correcting errors in data. RAID schemes, for example, include built-in techniques to detect and, in some cases, to correct corrupted data. Error detection and correction is often performed by using a combination of checksums and parity. Error correction can also be performed at a lower level, such as at the disk level.
In file servers and other storage systems, occasionally a write operation executed by the server may fail to be committed to the physical storage media, without any error being detected. The write is, therefore, “lost”. This type of the fault is typically caused by faulty hardware in a disk drive or in a disk drive adapter dropping the write silently without reporting any error. It is desirable for a storage server to be able to detect and correct such “lost writes” any time data are read.
While modern storage servers employ various error detection and correction techniques, these approaches are inadequate for purposes of detecting this type of error. For example, in at least one well-known class of file server, files sent to the file server for storage are first broken up into 4 KByte blocks, which are then formed into groups that are stored in a “stripe” spread across multiple disks in a RAID array. Just before each block is stored to disk, a checksum is computed for that block, which can be used when that block is subsequently read to determine if there is an error in the block. In one known implementation, the checksum is included in a 64-Byte identity/signature structure that is appended to the end of the block when the block is stored. The identity/signature structure also contains: a volume block number (VBN) which identifies the logical block number where the data are stored (since RAID aggregates multiple physical drives as one logical drive); a disk block number (DBN) which identifies the physical block number within the disk in which the block is stored; and an embedded checksum for the identity/signature structure itself. This error detection technique is sometimes referred to as “block-appended checksum”.
Block-appended checksum can detect corruption due to bit flips, partial writes, sector shifts and block shifts. However, it cannot detect corruption due to a lost block write, because all of the information included in the identity/signature structure will appear to be valid even in the case of a lost write.
Parity in single parity schemes such as RAID-3, RAID-4 or RAID-5 can be used to determine whether there is a corrupted block in a stripe due to a lost write. This can be done by comparing the stored and computed values of parity, and if they do not match, the data may be corrupt. However, in the case of single parity schemes, while a single bad block can be reconstructed from the parity and remaining data blocks, there is not enough information to determine which disk contains the corrupted block in the stripe. Consequently, the corrupted data block cannot be recovered using parity.
Mirroring data to two disks (such as with RAID 1, RAID 10, RAID 0-1, etc.) also does not allow correction of lost writes. A mismatch between two copies of data without checksum errors may indicate lost writes, but it is not possible to determine which data are correct.
Another technique, which is referred to herein as “RAID-DP”, uses diagonal parity in addition to row parity, and is described in U.S. patent application Publication no. 2003/0126523, which is assigned to the assignee of the present application. RAID-DP allows two bad blocks in a parity group to be reconstructed when their positions are known.
It is desirable to be able to detect and correct an error in any block anytime there is a read of that block. However, checking parity in both RAID-4 and RAID-DP is “expensive” in terms of computing resources, and therefore is normally only done when operating in a “degraded mode”, i.e., when an error has been detected, or when scrubbing parity (normally, the parity information is simply updated when a write is done). Hence, using parity to detect a bad block on file system reads is not a practical solution, because it can cause potentially severe performance degradation due to accessing the other disks in the group for parity computation.
Read-after-write is another known mechanism to detect data corruption. In that approach, a data block is read back immediately after writing it and is compared to the data that was written. If the data read back is not the same as the data that was written, then this indicates the write did not make it to the storage block. Read-after-write can reliably detect corrupted block due to lost writes, however, it also has a severe performance impact, because every write operation is followed by a read operation.