Individual blocks and/or sequences or lists of blocks on a persistent storage device (e.g., a hard disk drive) can become corrupted for various reasons, such as due to software defects in any of the layers involved in the I/O path (e.g., a defect that causes the I/O to be directed to the wrong block). Or, blocks on a persistent storage device can become corrupted for electro-mechanical reasons, such as due to media degradation (e.g., bit rots, where the magnetic material decays over time), or such as due to hard disk head alignment problems (e.g., resulting in data being written to the wrong blocks). Or, in some cases, blocks on a persistent storage device can become corrupted or lost, such as due to user error (e.g., when blocks are inadvertently overwritten or accidentally lost).
Legacy hard disk drives have the capability to remap bad sectors on disks when processing a write command to write to a given block. However, in legacy implementations, firmware for the hard disks do not have any capability to recognize the fact that data in a particular block on a disk has gone bad—at least not until a process reads the bad block. Some high-end storage arrays employ a technique called “disk scrubbing”, which involves a periodic reading of all of the blocks of the disk in an attempt to recognize bad blocks during the disk scrubbing process rather than wait until some other process experiences a read error (e.g., if/when a corrupted block is read). Some disk scrubbers have the capability to restore bad blocks to an uncorrupted state by retrieving an uncorrupted copy of the data from a redundant copy of that data (e.g., from a mirror site), and write the uncorrupted data to a good block, possibly also marking the corrupted block as a bad block, so that no further data writes to the back block are attempted.
However, in some situations, (e.g., when the redundancy is managed/maintained by a host-based volume manager), the aforementioned disk scrubbing technique does not work. In such situations, applications are left with the responsibility of recovering from bad blocks. Yet, in many real-world situations, the discovery of a latent bad block may go undetected over a long period of time—especially in a write-once scenario such as is used in backup or archival of data. Still worse, the existence of latent, undetected, corrupted blocks in the systems can lead to a serious data loss when a failure causes the latent, undetected, corrupted blocks from a redundant/archived copy of the data to be restored in the false assumption that the restored copy is an uncorrupted copy.
Even in high availability systems, corruption recovery techniques are not triggered until after corruption has been discovered by the application. However, as noted above, in applications where the data is written once and read very infrequently, any latent corruption (e.g., physical corruption and/or logical corruption) can go undetected for a long period of time. More particularly, backup and recovery data (e.g., data needed to recover the system from a catastrophic failure), tends to be written once and subsequently read very infrequently, so the existence of corrupted blocks can render the entire system to be highly vulnerable to a complete outage.
Moreover, the aforementioned disk scrubbing technologies do not have the capabilities to recognize logically corrupted blocks, and legacy solutions for recovering from logical corruptions do not decrease the potential for complete data loss in the event of other failures in the system.
Therefore, there is a need for an improved approach for implementing early detection of logical corruption in persistent storage devices that address at least these problems.