1. Technical Field
The present invention relates to an improved digital storage system. In particular, the present invention relates to a method and system for preventing hard disk drive failures. More particularly, the present invention relates to, in response to detecting an imminent hard disk drive failure, a method and system for taking corrective action to prevent a failure. Still more particularly, the present invention relates to utilizing predictive failure analysis to determine that corrective action is necessary and undertaking such corrective action thereby minimizing hard disk drive failures and the effects thereof.
2. Description of the Related Art
Generally, a digital data storage system consists of one or more storage devices that store data on storage media such as magnetic or optical data storage disks. In magnetic disk storage systems, a storage device is called a hard disk drive (HDD), which includes one or more storage disks and an HDD controller to manage local operations concerning the disks.
A number of known storage subsystems incorporate certain techniques and devices to predict storage device failures, along with other techniques and devices to protect data from being lost or corrupted by such failures. However, as discussed below, these systems do not adequately address methods or devices for utilizing drive error information within the storage device to prevent a read/write error from occurring.
There are several electromechanical performance parameters within a HDD that, when unobtrusively monitored, may provide warnings of impending drive failures. These parameters include but are not limited to: signal amplitude and resolution; head fly height; and channel noise signal coherence. A common method of utilizing such information is known to those skilled in the art as Predictive Failure Analysis (PFA).
Data storage systems, such as hard disk drives, commonly employ PFA as a self-diagnostic tool. PFA is usually implemented via micro-code instructions that control drive assemblies. The main purpose of PFA (sometimes referred as Self-Monitoring, Analysis and Reporting Technology, or xe2x80x9cSMARTxe2x80x9d) is to issue warnings to users that the hard disk drive is deteriorating and may xe2x80x9ccrashxe2x80x9d. PFA is implemented by performing periodic self-diagnostic tests. For example, PFA may be utilized to measure and compare current parameter values against those stored at the time of manufacture. PFA may also be utilized to examine the time rate of change of HDD performance parameters. An example of such a parameter is resolution, which is correlated to the fly height of a magneto-resistive (MR) head. Consistent with current implementations of PFA, a detected increase in resolution beyond some pre-determined threshold may trigger a PFA warning.
Data Recovery Procedure (DRP) is a common disk failure remedial operation that is invoked whenever a user initiates a command that fails to be properly executed by the hard disk drive. For example, a typical cause of read/write errors is improper positioning of the MR head by HDD control circuitry. The result may be a degraded signal-to-noise ratio that prevents the disk drive from properly decoding the read-back signal. In response to such a failed read attempt, the control circuitry may alert and trigger initiation of DRP processes. Thus, DRP is a collection of operations intended to alleviate the errors caused by disk drive failures. In the example above, one possible DRP operation would be to command the drive control circuitry to reposition the MR head. Unlike PFA, which is conducted during xe2x80x9cstandbyxe2x80x9d periods (periods between read/write operations) , DRP is itself a type of read/write operation, and therefore occurs while the hard disk drive is in its read/write or xe2x80x9cactivexe2x80x9d mode. Therefore, the time allotted to DRP is typically minimized to reduce its impact to user operations (to avoid jittery video for example). DRP is therefore geared to data recovery and is invoked in-stream with the user operations such as xe2x80x9creadxe2x80x9d and xe2x80x9cwritexe2x80x9d operations.
Based on the foregoing, it can be appreciated that a need exists for an improved method and system that would allow the drive control circuitry to both predict and respond to deteriorating or otherwise non-optimal disk drive performance. Such a method and system, if implemented, would be useful by leveraging existing drive error prediction and recovery tools such as PFA and DRP so that faulty drive performance may be diagnosed and corrected in order to prevent a drive failure and the resulting loss of data.
It is therefore an object of the invention to provide a method and system for improving a digital storage system.
It is another object of the invention to provide an improved method and system for preventing hard disk drive failures.
It is still another object of the invention to provide an improved method and system for predicting a hard disk drive failure and taking corrective action to prevent such a failure.
It is yet another object of the invention to provide an improved method and system that utilize predictive failure analysis to determine that corrective action is necessary and undertaking such corrective action thereby minimizing hard disk drive failures and the effects thereof.
The above and other objects are achieved as is now described. A method in a data processing system for minimizing read/write errors caused by impaired performance of a hard disk drive during runtime operation of said hard disk drive is disclosed. The runtime operation includes an active mode and a standby mode. First, at least one performance parameter of a hard disk drive is monitored during a standby mode of operation. Next, in response to monitoring at least one performance parameter, preventive recovery action is periodically performed only during said standby mode of operation, such that said performance parameter is maintained at an acceptable value without interfering with hard disk drive operation during an active mode.