Data corruption is a major problem in large-scale data storage systems and in data transmission systems. In the short term, the corrupted data cause applications to return erroneous results and may result in the failure of the applications. Over the long term, the corrupted data may be replicated through multiple systems. In many instances, if the corruption is detected and the cause determined, the correct data may be recoverable.
Data corruption may occur due to anomalies in the input/output (I/O) datapath of data, including errors introduced by servers, networks, device interconnects, and storage systems. Three classes of data errors, bit corruption, misdirected I/Os, and phantom I/Os, are particularly difficult to detect. Bit corruption occurs when bits in a data block are erroneously changed or lost in the datapath. A misdirected I/O is caused by the storage system reading or writing the wrong block. A phantom write occurs when the storage system acknowledges writing a data block but the block is not actually written, leaving old data on the storage device. A phantom read occurs when the storage system sends erroneous data through the datapath in response to a read command, typically due to errors in the storage system controller or the storage device itself.
Integrity metadata, such as checksums and replicated data, may be used to detect the three classes of errors but typically, each component in the datapath associates its own form of integrity metadata with the data once the data is received. Thus, such metadata only covers a portion of the datapath so that data errors that occur prior to reaching a component may not be detected. Even if data corruption is detected, the source of the corruption can be difficult or impossible to identify because of the discontinuity of coverage of the metadata.