Generally, data storage systems have one or more data storage devices that store data on storage media such as a magnetic or optical data storage disc. In magnetic storage, for example, one or more of the magnetic discs are grouped together in a disc drive.
Preferably, the disc drive has a disc drive controller that is responsive to program instructions to unobtrusively monitor status and various operational parameters in order to predict a potential failure before it occurs. A widely employed Predictive Failure Analysis (PFA) tool is Self-Monitoring, Analysis and Reporting Technology (SMART). PFA purposively issues an indication when conditions exist or appear to be trending that are commensurate with a failure mode. PFA can be implemented by performing self-diagnostic tests, such as by comparing current parametric values against those stored in memory during manufacturing. PFA can also predict a failure based on the observed time rate of change of parametric values.
Disc drives with predictive failure capability, sometimes referred to as “SMART drives,” can further employ Data Recovery Procedures (DRP) to preventively recover from a predicted failure. As an example, the SMART drive might indicate a predicted failure based on degraded signal-to-noise ratio. As a result, DRP circuitry might initiate a reposition of the MR head.
As storage capacity and flexibility demands have increased in recent years, the use of storage area networks (SAN) has proliferated. In a SAN, disc drives are grouped into an array and either used collectively as bulk storage or partitioned into discrete storage entities. Within the SAN it is advantageous to store data in a fault tolerant arrangement, such as in a Redundant Array of Independent Discs (RAID). This permits a recovery of corrupted data either by retrieving mirrored data or by reconstructing the data from stored parity information.
DRP and RAID both are aimed at maintaining highly reliable stored data. They do so, however, in different and many times conflicting ways. For example, DRP emphasizes in-situ repair of a failure condition, but at a relatively high cost of processing overhead that is necessary to recover from the predicted failure. SMART drives, being originally employed mainly in stand-alone systems, are often over-inclusive in predicting failures in that they tend to fault on the side of ensuring the data integrity. RAID systems, contrarily, are typically employed within a scalable storage capacity that can be grown if necessary to accommodate failures. Sparing, for example, is typical in RAID systems whereby extra disc drives are available for use in the event of a storage failure. When a threshold amount of the sparing has been utilized, it is more efficient to add additional sparing capacity or copy data from the failed drives and replace them than it is to perform in-situ recovery procedures.
What is needed is a solution that leverages both the predictive failure and in-situ advantages of DRP and the flexibility and efficiency advantages of RAID to minimize the instances of unscheduled maintenance in a data storage subsystem. It is to these advantages that the embodiments of the present invention are directed.