Current generation magnetic disk storage devices are vulnerable to data integrity problems that appear gradually. At first, recovery from such problems may require only the use of the conventional drive Error Correcting Code (ECC) processing or Input/Output (I/O) retry operations. However, these problems gradually worsen to the point that data may become unrecoverable. An example of a data integrity problem of this kind is known as “track squeeze”.
Track squeeze is seen especially in very high data density devices, when they are used under high loads in server applications. It appears when a track on the disk drive is written only rarely, while one or both of the adjacent tracks are written much more frequently. Due to the finite positioning tolerance of the head actuator mechanism, the electromagnetic forces used to effect adjacent track writes intrude to some extent into the rarely written track, causing reduced signal strength of the affected track. This in turn causes data errors during read operations. This problem can be reduced or avoided by reducing the track density on the disk surface or increasing the sophistication and accuracy of the head actuator and the data read process, but all of these techniques have associated cost.
When errors such a track squeeze initially begin to appear, the impact is modest enough that conventional disk drive error recovery mechanisms (such as read retry or drive ECC) can recover the data. In that case, the problem is not visible to higher layer I/O processing or application programs as an error, but it nevertheless causes performance loss due to the time required to perform these corrective mechanisms.
As gradual onset errors such as track squeeze become more severe, they progress beyond the point where disk drive error recovery mechanisms can handle them. In that case, I/O operations begin to fail at the disk drive level. Mechanisms for fault tolerant data storage such as Redundant Arrays of Independent Disks 0 are effective for maintaining data availability even in the presence of unrecoverable errors in the underlying disk drives. However, this is only the case when the error rate is low enough that the probability of errors beyond the recovery capability of RAID is extremely low. Therefore, normal practice with RAID is to consider as “failed” any disk drive that produces more than a very low error rate.
In the presence of problems such as track squeeze, such a practice may cause disk drives to be considered as “failed” at a rate well in excess of what is acceptable to customers or economically tolerable to storage system suppliers.