Disk drives are designed to store and retrieve data. With increasing capacities and higher densities, disk drives are becoming less reliable in performing these functions.
Three disk behaviors contribute to corruption of data stored on a disk drive. During a write, the disk arm and head must align with very accurate precision on the track that comprises the physical block in order to deposit the new “bits” of write data. In the case of a write, two tracking errors can occur. Either the head can be misaligned so badly that the data is written to a completely unintended track or the head may be misaligned so that the data falls in a gap between two adjacent tracks.
In the former case, called a Far Off-track Write, two physical blocks are placed in error because the target block is not overwritten and so comprises stale data and the overwritten block has lost the data that should be there. In the latter case, called a Near Off-track Write, one block is placed in error because the target block is not overwritten.
A second type of error that also occurs during a write happens when the bits are not changed on the disk, for example, if the preamp signal is too weak to change the magnetic setting of the bits on the platter. In this case, the data remaining on the platter is stale (i.e., the data is not up-to-date with the write commands issued to the drive). These errors are called dropped writes because the bits are not recorded on the platter.
Both of the above-mentioned types of write errors are called “Undetected Write Errors” because the disk drops the write data in the wrong location and does not itself detect the problem. In the literature, the terms “dropped write” or “phantom write” are sometimes used to describe some or all of these situations.
A third type of error is a misaligned head placement when reading data. In this case, the disk may read the data bits from a completely unintended track (i.e., Far Off-track Read) or from a gap between two tracks (i.e., Near Off-track Read) and return incorrect data to the user or application. Both of these errors are typically transient and are corrected when a subsequent read occurs to the same track. In addition, if the read tracks correctly but on the unintended target of a Far Off-track Write, incorrect data will be returned to the user or application.
In all the above scenarios, the drive typically does not detect a problem and returns a successful status notice to the user, host or application. Other error scenarios may also occur where the disk returns a success status while the user or application gets incorrect data. Such write or read errors can be referred to as Undetected Disk Error (UDE). Because a disk drive cannot independently detect UDEs, other methods need to be provided to detect such errors. Two main solution classes are available in the related art for verifying the accuracy of data read or written to disk drives.
The first class is the file system or the application layer. For example, some file systems and many database systems use checksums on data chunks (e.g., 4 KB chunks) which are stored separate from the data chunks themselves. The checksums are read along with the data chunks; new checksums are recomputed from the read data chunks and are compared with the checksums read along with the data chunks. If the new checksum matches the old ones, then the read data chunk is assumed to be correct.
The above method has two fundamental limitations. First, said method typically cannot recover from detected errors, unless they are also integrated with some additional data redundancy such as redundant array of independent disk drives (RAID). Second, said method is not always the source for every disk read, and so checking may not occur as often as necessary.
For example, when the source of a disk read is not the file system or application layer, an underlying (and logically separate) layer in a RAID architecture may perform reads in the context of an application write (e.g., in a read-modify-write scenario). The application layer does not validate these types of reads. In such a case, the read may extract incorrect data from the disk and then use this incorrect data to update the RAID redundancy data. Thus, an error that goes undetected by the application may propagate errors in the underlying RAID layer, compounding the problem created by the drive.
RAID is a disk subsystem that is used to increase performance and/or provide fault tolerance. RAID architecture comprises a plurality of disk drives and a disk controller (also known as an array controller). RAID improves performance by disk striping, which interleaves bytes or groups of bytes across multiple drives, so more than one disk is reading and writing simultaneously. Fault tolerance is also achieved in a RAID architecture by way of implementing mirroring or parity.
U.S. Pat. No. 7,020,805, “Efficient Mechanisms for Detecting Phantom Write Errors”, US Patent Application 2006/0200497, “Detection and Recovery of Dropped Writes in Storage Devices”, and published paper “A Client-based Transaction System to Maintain Data Integrity”, by William Paxton, in Proceedings of the seventh ACM symposium on Operating systems principles, 1979, pp 18-23 provide examples of such systems.
A second class of methods to detect UDEs are implemented in the storage system itself, at a layer that is closer to the hardware layer so that every disk read and write that occurs in the system is monitored, whether the read or write is generated by the application layers or by the storage system layer itself. This class, however, cannot detect errors that occur in system layers that are higher than the storage system (e.g., in the network or internal host busses). It is desirable to have a method that not only detects a problem but also is capable of also locating where the error occurs and further, to correct the errors if possible.
There are a number of subclasses of methods that can be used within the storage system for detection of possible location and correction of UDEs. The first is based on parity scrubbing. RAID systems that protect against disk failures (such as RAID1 or RAID5) may use a method called “parity scrub” to detect these sorts of errors. For example, in a RAID5 system, the process involves reading the data and the respective redundancy data (i.e., parity data), recomputing the parity value and comparing the computed parity value with the parity value read from disk.
If the two parity values do not match, then an error has occurred. Unfortunately, RAID5 does not provide a means to locate or correct an error detected in the above manner. More importantly, these parity scrubs may not detect errors that have been masked by other operations that were applied to data between the occurrence of a UDE and the parity scrub operation.
For example, a UDE may occur during a write to a first disk in a RAID5 array that comprises four data disks and one parity disk. Subsequently, a write may be issued to the array for the second, third and fourth disks. Typically, an array will promote this operation to a full write by reading the data from the first disk, computing parity and writing out the new data to second, third and fourth disks and to the parity disk. After this operation, the data on the first disk is still incorrect, but the parity is now consistent with all the data (i.e., the parity now comprises the bad data on the first disk). As a result, a subsequent parity scrub will not detect the bad data.
Another example of error propagation occurs when subsequent to a UDE, a successful and correct write (e.g., using a read-modify-write methodology) occurs to the same location. Such operation will leave the parity corrupted with the effects of the bad data. In effect, the bad data moves from the disk with the UDE to the parity disk. Such migration effects can occur whenever the bad data is read from the disk in order to perform any write operation to the stripe.
Similar and even more complicated scenarios occur even with higher fault tolerant RAID algorithms such as RAID6. RAID6 is a fault tolerant data storage architecture that can recover from the loss of two storage devices. It achieves this by storing two independent redundancy values for the same set of data. In contrast, RAID5 only stores one redundancy value, the parity.
A parity scrub on a RAID6 array can detect, locate and correct a UDE (assuming no disks have actually failed) but only if no operations were performed on the stripe that may have migrated or hidden the UDE. Parity scrubs are very expensive operations and are typically done sparingly. Consequently, the conditional assumption that no operations that migrated or failed to detect UDEs have occurred before the scrub rarely holds in practice.
A location algorithm in the context of RAID6 (or higher fault tolerance) is disclosed in US Patent Application 2006/0248378, “Lost Writes Detection in a Redundancy Group Based on RAID with Multiple Parity.” This location algorithm must be used in conjunction with parity scrubs as an initial detection method. RAID parity scrub methods are incapable of reliably detecting and/or locating and correcting UDEs in an array.
A second subclass of methods for addressing the problem of UDEs within the storage system is based on the write cache within the system. The method described in US Patent Application 2006/0179381, “Detection and Recovery of Dropped Writes in Storage Devices” uses the cache as a holding place for data written to disk. Only after the data is re-read from the disk and verified is the data cleared from the cache. This is an expensive method due to a number of factors.
First, the discussed method requires using valuable cache space that could be used to improve read/write cache performance of the system. Second, it requires a separate read call (at some unspecified time) in order to validate the data on the disk. If that read occurs immediately after the data is written, Off-track Write Errors may not be detected because the head tracking system may not have moved.
If the read occurs when the system needs to clear the cache (e.g., to gain more cache space for another operation), then a pending operation will be delayed until the read and compare occurs. Alternatively, the read could happen at intermediate times, but it will impact system performance with the extra IOs.
A third subclass uses some form of metadata to manage the correctness of the data. The metadata is stored in memory and possibly on separate disks or arrays from the arrays the metadata represents. For example, US Patent Application 2005/0005191 A1, “System and Method for Detecting Write Errors in a Storage Device,” discloses a method for UDE detection. A checksum and sequence number for each block in a set of consecutive data blocks is stored in an additional data block appended immediately after. A second copy is stored in memory for the entire collection of blocks on the disk and this copy is periodically flushed to disk (which necessarily is a different disk) and preferably is stored on two disks for fault tolerance.
A related scheme is found in U.S. Pat. No. 6,934,904, “Data Integrity Error Handling in a Redundant Storage Array” where only checksums are used, but no particular rule is defined for the storage of the primary checksum. US Patent Application 2003/0145279, “Method for using CRC as Metadata to Protect Against Drive Anomaly Errors in a Storage Array” discloses a similar checksum algorithm for detection together with a location algorithm.
The above schemes suffer from the problems of high disk overhead and the additional IOs required to manage and preserve the checksum/sequence number data. Other examples of the third subclass are disclosed in U.S. Pat. No. 7,051,155, “Method and System for Striping Data to Accommodate Integrity Metadata.”
The fourth subclass of storage based UDE detectors is similar to the third subclass in that the fourth subclass also uses some form of metadata to verify correctness of data read from disk. However, in the fourth subclass, the metadata is kept within the array and is collocated with the data or the parity in the array. For example, U.S. Pat. No. 7,051,155, “Method and System for Striping Data to Accommodate Integrity Metadata” discloses an embodiment where one copy of the stripe metadata is stored within the stripe.
The above scheme provides a significant performance advantage when the system performs a read-modify-write to update data in the stripe. The method described in US Patent Application US2004/0123032, “Method for Storing Integrity Metadata in Redundant Data Layouts” uses extra sectors adjacent to the sectors of the parity strip(s) to store the metadata for the data chunks in the stripe. This method includes use of a generation number on the metadata, stored in NVRAM in order to verify the contents of the metadata.
Other examples of the fourth subclass include the methods applicable to RAID5 arrays that are described in U.S. Pat. No. 4,761,785, “Parity Spreading to Enhance Storage Access;” US Patent Application 2006/0109792 A1, “Apparatus and Method to Check Data Integrity When Handling Data;” and U.S. Pat. No. 7,051,155, “Method and System for Striping Data to Accommodate Integrity Metadata.”
In some disk storage systems, metadata is stored in non-volatile read access memory (NVRAM) or on rotating disks. The former has significant cost and board layout issues to accommodate the total volume of metadata that must be stored and managed, as well as the means to maintain the memory in non-volatile state. Furthermore, such memory takes a lot of motherboard real estate and this can be problematic.
Particularly, in fault tolerant storage systems, with at least two coordinated controllers, the NVRAM must be shared between the two controllers in a reliable manner. This introduces complex shared memory protocols that are difficult to implement and/or have performance penalties. Rotating disks, on the other hand, have significant performance penalties and reliability issues. That is, a rotating disk has very low latency compared to memory, so accessing (e.g., reading or writing) the metadata can have a significant performance impact on the overall system.
Additionally, rotating disks have a fairly low reliability record compared to memory. Consequently, vital metadata need to be stored at least as reliably as the data it represents. For example, when data is stored in a RAID6 array, wherein two disk losses may be tolerated, the metadata should also be stored in a manner that can survive two disk losses as well.
Unfortunately, the above requirements impose significant additional costs and performance impacts, because the above-mentioned classes and subclasses for detecting and correcting UDEs are either inefficient or ineffective in uncovering sufficient details about a read or write error to help locate and fix a problem in many circumstances. Thus, data recovery methods and systems are needed that can overcome the aforementioned shortcomings.