Various forms of network storage systems are known today. These forms include network attached storage (NAS), storage area networks (SANs), and others. Network storage systems are commonly used for a variety of purposes, such as providing multiple users with access to shared data, backing up critical data (e.g., by data mirroring), etc.
A network storage system includes at least one storage server, which is a processing system configured to store and retrieve data on behalf of one or more client processing systems (“clients”). In the context of NAS, a storage server may be a file server, which is sometimes called a “filer”. A filer operates on behalf of one or more clients to store and manage shared files in a set of mass storage devices, such as magnetic or optical disks or tapes. The mass storage devices may be organized into one or more volumes of a Redundant Array of Inexpensive Disks (RAID). Enterprise-level filers are made by Network Appliance, Inc. of Sunnyvale, Calif. (NetApp®).
In a SAN context, the storage server provides clients with block-level access to stored data, rather than file-level access. Some storage servers are capable of providing clients with both file-level access and block-level access, such as certain Filers made by NetApp.
In a large scale storage system, it is inevitable that data will become corrupted or stored incorrectly from time to time. Consequently, virtually all modern storage servers implement various techniques for detecting and correcting errors in data. RAID schemes, for example, include built-in techniques to detect and, in some cases, to correct corrupted data. Error detection and correction is often performed by using a combination of checksums and parity. Error correction can also be performed at a lower level, such as at the disk level.
In file servers and other storage systems, occasionally a write operation executed by the server may fail to be committed to the physical storage media, without any error being detected. The write is, therefore, “lost”. This type of the fault is typically caused by faulty hardware in a disk drive or in a disk drive adapter dropping the write silently without reporting any error. It is desirable for a storage server to be able to detect and correct such “lost writes” any time data is read.
While modern storage servers employ various error detection and correction techniques, these approaches are inadequate for purposes of detecting this type of error. For example, in at least one well-known class of file server, files sent to the file server for storage are first broken up into 4 KByte blocks, which are then formed into groups that are stored in a “stripe” spread across multiple disks in a RAID array. Just before each block is stored to disk, a checksum is computed for that block, which can be used when that block is subsequently read to determine if there is an error in the block. In one known implementation, the checksum is included in a 64 Byte metadata field that is appended to the end of the block when the block is stored. The metadata field also contains: a volume block number (VBN) which identifies the logical block number where the data is stored (since RAID aggregates multiple physical drives as one logical drive); a disk block number (DBN) which identifies the physical block number within the disk in which the block is stored; and an embedded checksum for the metadata field itself. This error detection technique is referred to as “block-appended checksum”.
Block-appended checksum can detect corruption due to bit flips, partial writes, sector shifts and block shifts. However, it cannot detect corruption due to a lost block write, because all of the information included in the metadata field will appear to be valid even in the case of a lost write.
Parity in single parity schemes such as RAID-4 or RAID-5 can be used to determine whether there is a corrupted block in a stripe due to a lost write. This can be done by comparing the stored and computed values of parity, and if they do not match, the data may be corrupt. However, in the case of single parity schemes, while a single bad block can be reconstructed from the parity and remaining data blocks, there is not enough information to determine which disk contains the corrupted block in the stripe. Consequently, the corrupted data block cannot be recovered using parity.
Another technique, which is referred to herein as RAID Double Parity (RAID-DP), is described in U.S. Patent Application Publication no. 2003/0126523. RAID-DP allows two bad blocks in a parity group to be reconstructed when their positions are known.
It is desirable, to be able to detect and correct an error in any block anytime there is a read of that block. However, checking parity in both RAID-4 and RAID-DP is “expensive” in terms of computing resources, and therefore is normally only done when operating in a “degraded mode”, i.e., when an error has been detected, or when scrubbing parity (normally, the parity information is simply updated when a write is done). Hence, using parity to detect a bad block on file system reads is not a practical solution, because it can cause potentially severe performance degradation due to parity computation.
Read-after-write is another known mechanism to detect data corruption. In that approach, a data block is read back immediately after writing it and is compared to the data that was written. If the data read back is not the same as the data that was written, then this indicates the write did not make it to the storage block. Read-after-write can reliably detect corrupted block due to lost writes, however, it also has a severe performance impact, because every write operation is followed by a read operation.
Another mechanism is described in the parent of the present application, i.e., U.S. patent application Ser. No. 10/951,644, filed on Sep. 27, 2004 and entitled, “Use of Application-Level Context Information to Detect Corrupted Data in a Storage System,” of J. Kimmel et al. The described mechanism stores file system context information in block-appended checksums, for use in detecting lost writes. However, this mechanism can detect data corruption only when the data blocks are accessed through the file system. When block reads are initiated by the RAID layer, such as to compute parity, to “scrub” (verify parity on) a volume, or to reconstruct a block (e.g., from a failed disk), the RAID layer does not have the context information of the blocks. Therefore, this mechanism does not help detect lost writes on RAID-generated reads. RAID-generated reads for parity computations can propagate corruption to parity. Therefore, protection of RAID-generated reads can be crucial in making a storage server resilient to lost writes.