An established technology that provides large disk capacity with very high reliability is a redundant array of independent disk drives (RAID). RAID uses multiple physically separate disk drives which act as a single drive and all are accessed through a single array controller. The physically separate disk drives may appear as multiple virtual drives. There is typically more than one controller in a system for redundancy, although normally only one controller at a time can access data in a volume. For data reliability, a parity block is derived from related data blocks of the various disk drives through exclusive-or (XOR) operations, permitting the rebuilding of data from one disk drive that fails by processing the data of the other drives along with the parity block. Data may be stored as a sector, a segment, a stripe, or a volume, across the disk drives of a RAID system.
In extremely rare instances, a disk drive may return incorrect data without indicating that an error has occurred. These types of errors can occur both when writing data to, and reading data from, media. For example, the drive may write data to the wrong physical address. Similarly, the drive may read data from the wrong physical address. There are other types of drive anomaly errors, but they all share a common characteristic: the initiator of any read or write request should assume that part or all of the data may be incorrect even though there is no error indication from the drive.
In a high availability storage array, it is desirable to detect and, if possible, recover from drive anomaly errors. In general, data recovery is possible through RAID parity schemes as long as the storage array is able to identify the drive in error.
It is relatively common to provide data path protection through the use of CRC alone, but such a scheme does not recover from drive anomalies such as dropped or mis-directed write operations. Data path protection schemes typically format the drives to a larger sector size (size as 520 bytes), and store the CRC for each 512 bytes of user data in the larger sector.
One approach to solving the problem is to store write sequence tracking metadata interleaved with user data. The sequence information is stored on two separate disks as metadata during write operations. Anomaly protection is provided by having the write sequence tracking information on two separate drives. The data and metadata are interleaved for performance reasons. Interleaving allows the data and associated metadata on each drive to be either written or read with a single I/O operation. If the sequence information on the data drive differs from the parity drive, the sequence information can be used to determine which drive is in error. If the data drive is in error, the data is extracted from the parity drive via reconstruction techniques.
With this approach, writes are tracked at two levels of granularity. The first level is when the scope of a write operation is limited to an individual drive plus the associated parity drive. In this case, the level of granularity is a data block such as the cache block size used to manage the storage controller's data cache. A data block may be as large as a segment (i.e., the amount of data from one drive of a stripe) or as small as a sector (i.e., a logical block forming part of a segment). Each data block within a data stripe has its own separate revision number. The revision numbers of all data blocks are stored on the associated parity drive.
In this case, a data block is the unit of data for which metadata is managed (one or more sectors). The size is chosen based on a trade-off between disk capacity utilization and performance. The easiest way to manage metadata is by placing all metadata (write sequence information plus CRC) for a given data block in a single sector. The smaller the data block, the more disk space is used for metadata. On the other hand, as the data block size increases, it increases the likelihood that the host I/O size will be smaller than the data block size, which means, for example, that extra data will have to be read and discarded on read operations. The controller's data cache block size is likely to vary based on the same considerations, so it is convenient to link the metadata data block management size to the cache block size.
The second level of granularity is when all data blocks within a stripe are written. Each storage controller maintains a monotonically increasing value that is used to track full stripe writes on a storage controller basis. Tracking full stripe writes separately allows the controller to avoid having to perform a read-modify-write function on all of the associated data block revision numbers. When a full stripe write occurs, all data block revision numbers are initialized to a known value.
In order to provide complete data integrity protection, the write sequence tracking scheme must be implemented in conjunction with CRC or other form of error detection and correction code that provides data integrity assurance at a byte level to protect against drive anomaly errors in which the majority of data in the sector or sectors is correct. The CRC information can be stored as metadata along with the write sequence tracking information.
The write sequence tracking mechanism has the limitation that drive anomaly errors are either unrecoverable or undetectable for certain failure modes. The problem arises from the fact that the data block revision numbers protect the data drives, but not the parity drive. Since data block revision numbers for all data blocks across the stripe are stored in the same sector on the parity drive for space considerations, a dropped write to the parity drive may not be detected prior to the parity drive being updated for another data block in the stripe.
In addition, the write sequence tracking mechanism is relatively complex to implement. The metadata has self-identification features that must be managed. In a system with redundant storage controllers, management of the numbers must account for transfer of volume ownership between the controllers. In addition, the sequence number management must account for transfer of volume ownership between the controllers.
Finally, there is a certain lack of flexibility in how the metadata is managed since it is typically linked to the cache block size used to manage the storage controller's data cache.
Therefore, it would be desirable to provide a system and method for recovering from disk drive anomalies which offers simplicity, symmetrical protection for data and parity drives, and flexible management for the metadata.