Data storage utilization is continually increasing, causing the proliferation of storage systems in data centers. Hard disk drives are the primary storage media in enterprise environments. Despite the central role hard disks in storing precious data, they are among the most vulnerable hardware components in a computer system. Monitoring and managing these systems require increasing amounts of human resources. Information technology (IT) organizations often operate reactively, taking action only when systems reach capacity or fail, at which point performance degradation or failure has already occurred. Hard disk failures fall into one of two basic classes: predictable failures and unpredictable failures. Predictable failures result from slow processes such as mechanical wear and gradual degradation of storage surfaces. Monitoring can determine when such failures are becoming more likely. Unpredictable failures happen suddenly and without warning. They range from electronic components becoming defective to a sudden mechanical failure (perhaps due to improper handling). However, a disk failure may not follow a simple fall-stop model. The fault model presented by modern disk drivers is much more complex. Amongst all the different types of errors, whole-disk failures and sector errors are the major faults that affect data safety.
One of the errors is the medium error. This error occurs when a particular disk sector cannot be read. Any data previously stored in the sector is lost. The disk interface reports the status code upon detecting a sector error, specifying the reason why the read command failed. Sector-level errors can occur even when a sector has not been accessed for some time. Therefore, modern disk drives usually include an internal scan process, as shown in FIG. 1, which scans and checks sector reliability and accessibility in the background. Unstable sectors detected in the process will be marked as pending sectors, and disk drives can try rectifying these errors through internal protection mechanisms, such as built-in error correction codes and refreshment, which rewrites sector with the data read from that track to recover the faded data. Any sectors that are not successfully recovered will be marked as uncorrectable sectors. After a number of unsuccessful retries, disk drives automatically re-map a failed write to a spare sector. More precisely, a logical block address (LBA) is reassigned from the failed sector to a spare sector and the content is written to the new physical location. Modern disk drives usually reserve a few thousand spare sectors, which are not initially mapped to particular LBAs. Re-mapping can only occur on detected write errors.
Conventionally, disk scrubbing aims to verify data accessibility and proactively detect lost data on failed sectors which can be recovered through redundant array of independent disks (RAID) redundancy. Thus, it only scans live sectors (e.g., those storing data accessible through a file system), which may not be sufficient enough to detect a vulnerable disk. A vulnerable disk refers to a disk that is likely to fail in the near future or is probably to have bursts of sector errors.