1. Field of the Invention
The present invention relates generally to disk fault correction techniques for storage devices and, more particularly, to a method of correcting disk drive media faults while the hard drive is idle.
2. Description of Related Art
The vast majority of personal computer (PC) systems available today come equipped with a peripheral data storage device such as a hard disk (HD) drive. Hard disks are comprised of rigid platters, made of aluminum alloy or a mixture of glass and ceramic, covered with a magnetic coating. Platters vary in size and hard disk drives generally come in two form factors, 5.25 in or 3.5 in. Typically, two or more platters are stacked on top of each other with a common spindle that turns the whole assembly at several thousand revolutions per minute. There is a gap between the platters, making room for a magnetic read/write head, mounted on the end of an actuator arm. There is a read/write head for each side of each platter, mounted on arms which can move them radially. The arms are moved in unison by a head actuator, which contains a voice coilxe2x80x94an electromagnetic coil that can move a magnet very rapidly.
Each platter is double-sided and divided into tracks. Tracks are concentric circles around the central spindle. Tracks physically above each other on the platters are grouped together into a cylinder. Cylinders are further divided into sectors. Depending on the disk drive vendor, a sector is typically comprised of 512 bytes of user data, followed by a number or number of cross-check bytes, a number of error correction code (ECC) bytes and other vendor specific diagnostic information. Thus, these devices are complex electro-mechanical devices and, as such, can suffer performance degradation or failure due to a single event or a combination of events.
There are generally two general classes of failures that can occur in disk drives. The first class is the xe2x80x9ccatastrophicxe2x80x9d type of failure which causes the drive to quickly and unpredictably fail. These failures can be caused by static electricity, handling damage, or thermal-related solder problems. Probably, the only way to prevent these failures, if at all, is through more controlled manufacturing and handling processes. Certainly, there is little hope of predicting these types of failures once the drive it put in service.
The second class of failures result from the gradual decay of other electrical and/or mechanical components within the drive after it is put in service. Before this larger class of failures is discussed, it is important to understand some of the correction schemes built into the disk drives to overcome the most common failurexe2x80x94media defects.
Most drives include an error detection mechanism to catch errors during read operations. While this type of defect correction is adequate to catch defects as the sectors are read, it does nothing to catch latent defects in sectors that have not been read. This is important since sometimes data is not read back from the disk for a very long time after it has been written. As time passes, defects sometimes grow past the point of correctability. Thus, this technique is only adequate for on the fly correcting.
Historically, there are also several ways for users to manage this class of failure:
1. Do nothing but wait for the drive to fail and then replace the drive. This is the easiest but will cause much down time and lost data when the drive fails.
2. Practice periodic preventative maintenance and simply replace the drive before it fails. This is somewhat effective in reducing unscheduled down time but suffers from the high cost of replacing drives before their life has been exhausted.
3. Use redundancy or backups. This technique is also effective in reducing unscheduled down time. It does not require the drives to be replaced before they fail, but suffers from the cost of having duplicate or additional hardware.
4. Rely on the disk drives built in error correction schemes to make corrections as the data is read.
5. Use Predictive Failure Analysis (PFA). Because this second class of failure can occur over time, it is possible to predict these types of failures by monitoring conditions of the drive.
Disk Drive Error Correction and Detection
Because disk drives are inherently defect prone, error correction scans are performed on the disk drives at the factory for marking any defective sectors before the drives are put into service. Disk drives also have error checking built in for field use. Each sector includes a number of ECC bytes and cross-check bytes. The cross-check bytes are used to double check the main ECC correction and reduce the probability of miscorrection. The cross-check and ECC bytes are computed and appended to the user data when the sector is first written with data
Each time the drive reads a sector of data, it generates a new set of ECC and cross-check bytes based on the 512 bytes of data contained within the sector. The new set of cross-check and ECC bytes is compared with the corresponding bytes originally written in that particular sector. This comparison process results in bytes that are known as syndromes. If all of the syndrome values are zero, the data has been read with no errors, and the sector of data is transferred to a host computer. If any of the syndromes are non-zero, an error has occurred. The type of correction applied by the drive then depends on the nature and extent of the error and the vendor""s proprietary techniques.
When a data error occurs, the disk drive checks to see if the error is correctable on the fly. If correctable on the fly, the error is corrected and the data is transferred to the host system. Errors corrected in this manner are invisible to the host system.
If the data is not correctable on the fly, the sector is typically re-read a number of times in an attempt to read the data correctly before applying more sophisticated correction algorithms. This strategy prevents invoking correction on non-repeatable or soft errors. Each time a sector in error is re-read, a set of ECC syndromes is computed. If all of the syndrome values are zero, the data was read with no errors, and the sector is transferred to the host system. If any of the syndromes are not zero, an error has occurred, the syndromes are retained, and another re-read is invoked. Depending on the disk drive vendor, the drive typically attempts a number of re-reads with more sophisticated ECC algorithms. If an automatic read reallocation feature is enabled, the drive, when encountering defective sectors can easily and automatically reallocate the defective sector to a good sector. (Most drives include an automatic read reallocation feature which, when set, indicates that the drive will enable automatic reallocation of bad sectors when encountered).
Most drives allocate a number of spare sector pools, each pool containing a small number of spare sectors. If a sector on a cylinder is found to be defective, the address of the sector is added to the drive""s defect list. Sectors located physically subsequent to the defective sector are assigned logical block addresses such that a sequential ordering of logical blocks is maintained. This inline sparing technique is employed in an attempt to eliminate slow data transfer that would result from a single defective sector on a cylinder. If more than the number of spare sectors in a single pool are found defective, the above inline sparing technique is applied to the to the single pool only. The remaining defective sectors are replaced with the nearest available pool of spares.
Defects that occur in the field are known as grown defects. Sectors are considered to contain grown defects if the sophisticated ECC algorithm must be applied to recover the data. If this algorithm is successful, the corrected data is stored in the newly allocated sector. If the algorithm is not successful, a pending defect will be added to the defect list. Any subsequent read to the original logical block will return an error if the read is not successful. A host command to over-write the location will result in multiple write/read/verifies of the suspect location. If any of the multiple write/read/verifies fail, the new data will be written to a spare sector, and the original location will be added to the permanent defect list. If all multiple write/read/verifies pass, data will be written to the location, and the pending defect will be removed from the defect list.
Predictive Failure Analysis
PFA monitors key drive performance indicators for change over time or exceeding specified limits. This technique has become known in the industry as Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T., hereinafter simply SMART).
SMART is an industry standard for both Small Computer System Interface (SCSI) and PC-AT Attachment (ATA) disk drive interfaces. The SMART standard for SCSI devices is defined in the American National Standards Institute (ANSI) SCSI Informational Exception Control (EEC) document X3T10/94-190 which is hereby incorporated by reference herein. The SMART standard for ATA devices is defined in the Small Form Factor (SFF) document SFF-8035, entitled xe2x80x9cSelf-Monitoring, Analysis and Reporting Technology,xe2x80x9d Revision 2.0, dated Apr. 1, 1996, (hereinafter referred to as the SMART specification) which is hereby incorporated by reference herein.
PFA and SMART techniques are disclosed in U.S. Pat. No. 5,828,583 to Bush et al, incorporated herein by reference. These techniques monitor device performance, analyze data from periodic internal measurements, and recommend replacement when specific thresholds are exceeded. The thresholds are determined by examining the history logs of disk drives that have failed in the field. In the first incarnation of SMART, the host computer polled the disk drive on a periodic basis to determine whether the disk drive was failing. In subsequent revisions, when commanded by the host computer the disk drive makes the determination and simply reports the status. When a failure is deemed imminent, the host computer signals end user or a system administrator. With sufficient warning, users have the opportunity to back up vital data and replace suspect drives prior to data loss or unscheduled down time.
Thus, as hard drive technology evolves to provide ever increasing amounts of data storage, a more proactive way of predicting and correcting the drive failures predicted by the PFA and SMART techniques is desired.
According to a preferred embodiment, the present invention includes a method, apparatus and computer system for detecting and correcting errors in a storage device. The storage device includes media that is addressable in small units, such as sectors, for storing data. Periodically, the storage device scans the media for errors and defects. If a data error is correctable, the data is rewritten to the media and tested again. If the error repeats, the media is deemed defective and the data is relocated to another sector.
Preferably, the scanning is performed during idle periods. The storage device waits for a certain usage period to expire before scanning the entire storage device. Once passed, the storage device waits for the device to be idle before performing one or more scans. The media is preferably scanned in segments comprising a plurality of sectors so that the device scanning operation can be broken into smaller operations. After a segment is complete, the storage device calculates the elapsed time to scan the last segment and stores the value.
The storage device maintains a count of the number of defects and defective sectors are identified in a defect list.