1. Field of the Invention
This invention relates to error detection in storage systems.
2. Description of the Related Art
Many storage arrays provide protection against data loss by storing redundant data. Such redundant data may include parity information (e.g., in systems using striping) or additional copies of data (e.g., in systems providing mirroring). A storage system's ability to reconstruct lost data may depend on how many failures occur before the attempted reconstruction. For example, some RAID (Redundant Array of Independent/Inexpensive Disks) systems may only be able to tolerate a single disk failure or error. Once a single disk fails or loses data through an error, such systems are said to be operating in a degraded mode because if additional disks fail before the lost data on the failed or erroneous disk has been reconstructed, it may no longer be possible to reconstruct the lost data. The longer a storage array operates in a degraded mode, the more likely it is that an additional failure will occur. As a result, it is desirable to detect and repair disk failures or other anomalies so that a storage array is not operating in a degraded mode.
Errors that may cause a storage system to operate in a degraded mode include transmission errors, total disk failures, and disk errors. Transmission and disk errors may cause less data vulnerability or data loss than failures, but they may be more difficult to detect. For example, disk drives may occasionally corrupt data, and this corruption may not be detected by the storage system until the data is read from the disk. The corruptions may occur for various different reasons. For example, bugs in a disk drive controller's firmware may cause bits in a sector to be modified or may cause blocks to be written to the wrong address. Such bugs may cause storage drives to write the wrong data, to write the correct data to the wrong place, or to not write any data at all. Another source of errors may be a drive's write cache. Many disk drives use write caches to quickly accept write requests so that the host or array controller can continue with other commands. The data is later copied from the write cache to the disk media. However, write cache errors may cause some acknowledged writes to never reach the disk media. The end result of such bugs or errors is that the data at a given block may be corrupted or stale. Errors such as drive errors and transmission errors may be “silent” in the sense that no error messages are generated when such errors occur.
In general, it is desirable to detect errors soon after they occur so that a storage system is not operating in a degraded mode for an extended time. However, error detection mechanisms are often expensive to implement (e.g., if they require a user to purchase additional or more expensive hardware and/or software) and/or have a detrimental impact on storage system performance. Thus, it may be desirable to allow users to select whether to purchase the error detection mechanism independently of the overall system and/or to allow users to independently enable and disable the error detection mechanism.