Typical large-scale data storage systems today include one or more dedicated computers and software systems to manage data. A primary concern of such data storage systems is that of data corruption and recovery. Data corruption may occur in which the data storage system returns erroneous data and doesn't realize that the data is wrong. Silent data corruption may result from hardware failures such as a malfunctioning data bus or corruption of the magnetic storage media that may cause a data bit to be inverted or lost. Silent data corruption may also result from a variety of other causes; in general, the more complex the data storage system, the more possible causes of silent data corruption.
Silent data corruption is particularly problematic. For example, when an application requests data and gets the wrong data this may cause the application to crash. Additionally, the application may pass along the corrupted data to other applications. If left undetected, these errors may have disastrous consequences (e.g., irreparable undetected long-term data corruption).
The problem of detecting silent data corruption is addressed by creating integrity metadata (data pertaining to data) for each data block. Integrity metadata may include a block address to verify the location of the data block, or a checksum to verify the contents of a data block.
A checksum is a numerical value derived through a mathematical computation on the data in a data block. Basically when data is stored, a numerical value is computed and associated with the stored data. When the data is subsequently read, the same computation is applied to the data. If an identical checksum results then the data is assumed to be uncorrupted. Checksum algorithms are developed so as to minimize the probability that the checksum and its associated data will be corrupted in the same way. The strength of a checksum depends on how likely it is that a data block experiencing a typical type of error will not result in a data block with an identical checksum.
The issue of where to store the integrity metadata arises. For example, a typical checksum together with other integrity metadata may require 8-16 bytes. Typical data storage systems using block-based protocols (e.g., SCSI) store data in blocks of 512-bytes in length so that all input/output (I/O) operations take place in 512-byte blocks (sectors). One approach is simply to extend the block so that the checksum may be included. So, instead of data blocks of 512-bytes in length, the system will now use data blocks of 520 or 528 bytes in length depending on the size of the checksum. This approach has several drawbacks. The extended data block method requires that every component of the data storage system from the processing system, through a number of operating system software layers and hardware components, to the storage medium be able to accommodate the extended data block. Data storage systems are frequently comprised of components from a number of manufacturers. For example, while the processing system may be designed for an extended block size, it may be using software that is designed for a 512-byte block. Additionally, for large existing data stores that use a 512-byte data block, switching to an extended block size may require unacceptable transition costs and logistical difficulties.