1. Field of the Invention
The present invention relates to hard disk drives. More particularly, the present invention relates to a method for recovering data from marginally defective data sites on a disk surface.
2. Description of the Related Art
Hard disk drives store large volumes of data on one or more disks mounted on a spindle assembly. Disk drives employ a disk control system for interfacing with a host (e.g., a computer) to control the reading and writing of data on a disk. Each disk includes at least one disk surface which is capable of storing data. On each disk surface, user data is stored in concentric circular tracks between an outside diameter and an inside diameter of the disk.
As a result of the manufacturing process, defective data sites may exist on the disk surfaces of the disk drive. These defective data sites are termed xe2x80x9cprimary defects.xe2x80x9d A defect discovery procedure is performed to locate these defects and mark them out as defective locations on the disk surface which are not available for use. A typical defect discovery procedure includes writing a known data pattern to the disk surface and subsequently reading the data pattern from the disk surface. Defective data sites are identified by comparing the data pattern read from the disk surface with the known data pattern written to the disk surface.
Following the defect discovery procedure, defective data sites are put in a primary defect list (PLIST). The primary defect list is used during formatting of the disk surface to generate a defect management table. The defective data sites contained within the primary defect management table are skipped during normal operation of the disk drive. Once identified in the primary defect management table, the defective data site may not be used for storing data.
Defective data sites encountered after formatting the disk surface are known as xe2x80x9cgrown defectsxe2x80x9d or xe2x80x9csecondary defectsxe2x80x9d. Grown defects often occur in locations adjacent to defective data sites found during defect discovery. Grown defects are also written to a list known as the grown defect list (GLIST), similar to that utilized for the primary defects. Grown defects encountered during the operation of the disk drive are added to a secondary defect management table. The secondary defect management table is utilized along with the primary defect management table during the operation of the disk drive for the identification of defective data sites on the disk surface. The defective data sites residing within the secondary defect management table are reassigned or xe2x80x9cvectoredxe2x80x9d (i.e., mapped via an index pointer) to spare data site locations via a cross-reference entry (cylinder number, head number, and data sector number).
Defects such as xe2x80x9cprimary defectsxe2x80x9d and xe2x80x9cgrown defectsxe2x80x9d are known as hard sector errors. A hard sector error is essentially permanent in nature, thus the sector cannot be recovered. A disk may also contain transient or xe2x80x9csoftxe2x80x9d errors. A transient error is defined as an error or defect which clears over a period of time. For example, a transient error may occur due to a thermal asperity on the disk surface. A retry mode may be entered, wherein the command (such as a read) is retried a number of times allowing sufficient time to pass for the transient error to clear. Transient errors are also logged on the drive as they occur.
If the transient or xe2x80x9csoftxe2x80x9d error rate on a particular sector reaches an unacceptable level during normal read operations, the hard drive may attempt to rewrite the data to the data sector encountering the soft errors. The rewrite operation typically reads the data from the sector encountering the soft errors, copies the data to a data buffer, then attempts to rewrite the data to the data sector encountering the soft errors. Data is copied to the data buffer so that if the rewrite operation fails, the data is still available. After the data has been rewritten, the drive verifies the rewrite operation. If no problems are encountered after the rewrite, the drive resumes normal operation. If the rewrite operation fails, the data in the data buffer must be reassigned to alternative data sectors on the disk.
Several methods have been developed for reassigning defective data sites on a disk. Chan (U.S. Pat. No. 5,271,018) describes a data site slipping scheme to reassign defective data sites on a disk""s surface. In data site slipping, a defined number of spare sites are located at the end of a data track, data partition, or data zone for handling defects. When a defect is discovered, the defective data site is marked as defective (i.e., to be skipped in the future), the data from the next non-defective data site is saved in a data buffer, and the data from the defective data site is xe2x80x9cslippedxe2x80x9d to the next non-defective data site. The saved data from the data buffer is then slipped in a similar manner to the next non-defective data site until the last slipped data site occupies the first available spare data site. While the data site slipping scheme of Chan allows data to remain contiguous, the scheme of Chan may result in time consuming, inefficient reassignments of data blocks when a defective site is discovered thereby impacting disk drive performance.
Bish et al. (U.S. Pat. No. 5,235,585) provides another method for reassigning defective data sites on a disk via a vectoring operation. Bish et al. locates spare data sites for replacing grown secondary defects found during use. A secondary defect management table is maintained for tracking the spare data sites that have been previously used for replacements of other secondary defects. When a secondary defect address is found, a secondary defect list is updated both on the disk and in the drive""s memory. A spare data site for replacing the secondary defect is located by first calculating a group number to determine which group the secondary defect is located in. Next, the boundaries for a plurality of spare data sites allocated for that group is determined. After searching the replacement data site list, the first available spare data site of the plurality of spare data sites is identified as the replacement data site. If all of the plurality of spare data sites have been previously assigned as replacement data sites, a spare data site is instead located from a plurality of spare data sites allocated to a neighboring group. After locating the spare data site, its physical address is returned and the spare data site is logged in the secondary defect management table as a xe2x80x9cvectoredxe2x80x9d data site. The reassignment scheme of Bish eliminates the inefficiencies of the Chan slip data site reassignment scheme, since only a single block of data is moved during the reassignment. However, reassigning data to non-contiguous spare data sites eliminates contiguous ordering of the data, resulting in additional seeks to and from the spare data site location during a read operation.
Rewriting data to the same data sector is almost always preferable to reassigning data to an alternate data sector. By rewriting data to the same data sector, the inefficiencies of moving potentially thousands of blocks of blocks via data slip scheme described in Chan can be avoided. Also, rewriting preserves the contiguous ordering of the data. Thus, during read operations, no time-consuming data vector seek operations to alternative data sectors in a remote area of the disk are necessary, as is required by reassignment schemes such as described in Bish et al.
Unfortunately, the rewrite operation is sometimes unsuccessful even after multiple retries, so a reassignment operation is required to preserve the data. Also, the rewrite operation is susceptible to data loss. Even though data is preserved in a data buffer while the rewrite operation is attempted, the data in the data buffer can be lost if a power cycle occurs while the rewrite operation is executing. Furthermore, the time required to reassign the sector is sequential and additive to the rewrite process.
There is a continuing need therefore for a defect management scheme which minimizes operational inefficiencies, while dealing effectively with defective data sites which are discovered during disk operations.
The present invention provides a method for operating a disk drive, including recovering data from a marginally defective data site on a disk surface. The hard disk drive employs a vector reassignment method to reassign data from the defective data sites to spare data sites. The method for operating a disk drive begins by reading a block of data from a data site. The method then determines whether the data site is a marginally defective data site. Next, the method first writes the block of data to a spare data site within a pool of spare data sites. The method then determines whether the marginally defective data site is a defective data site. If the data site is marginally defective, the method marks the marginally defective data site as a defective data site, adds the defective data site to a list of defective data sites, and updates a vector reassignment table to cross-reference the defective data site to the spare data site.
In one embodiment of the present invention, the step of determining whether the marginally defective data site is a defective data site further includes the steps of writing the block of data back to the marginally defective data site and verifying that the block of data written back to the marginally defective data site can be successfully read.
In one embodiment of the present invention, a scoring method is utilized to determine whether a data site is a marginally defective data site. In a preferred embodiment, the scoring method includes an analysis of the number of retries encountered while reading the data site during the normal operation of the disk.
Checkpoint information is saved after selected steps of the method, such that if an interruption occurs during the data recovery, the data recovery process may be completed. More specifically, if an interruption occurs while writing the block of data back to the marginally defective data site, on resuming operation after the interruption the defective data site is entered in a defect table and the data site is reassigned to the spare data site.
In one embodiment of the present invention, the step of updating the vector reassignment table to cross-reference the defective data site to the spare data site includes defining a vector cross-reference entry for the vector reassignment table associated with the list of defective data sites. The vector cross-reference entry includes a reference to the defective data site which is reassigned and a corresponding reference to the spare data site. The step of updating the vector reassignment table to cross-reference the defective data site to the spare data site also includes making a vector cross-reference entry in the vector reassignment table to cross-reference the defective data site with the corresponding spare data site. The spare data site is chosen from one or more spare data sites located on the disk. The one or more spare data sites are contiguously grouped into a pool of spare data sites. In one embodiment of the present invention, a single pool of data sites exists on each disk.
If the marginally defective data site is determined to be a defective data site, the method further includes the step of incrementing a pointer to point to the next available spare data site within the pool of spare data sites. In one embodiment, the method of the present invention is implemented in firmware residing within a disk control system of the disk drive.
The present invention also provides a method for recovering data from a marginally defective data site on a disk surface. This method begins by providing a pool of spare data sites. The method also provides a read error recovery procedure, where the read error recovery procedure includes a plurality of recovery steps. The method next reads a data block from a user data site, and if an error occurs while reading the data block, the method executes the read error recovery procedure to recover the data block. The method then determines whether the user data site is marginally defective, and conditionally reassigns the data block to a conditionally reassigned spare data site within the pool of spare data sites. Next, the method writes a copy of the data block to the conditionally reassigned spare data site, and also writes the data block in the user data site. The method then performs a predetermined subset of the plurality of recovery steps to recover the data block. If the data block is not recovered within the predetermined subset, the method reassigns the data block to the conditionally reassigned spare data site.
In one embodiment of the present invention, the step of reassigning the data block to the conditionally reassigned spare data site further includes marking the marginally defective data site as a defective data site, adding the user data site to a list of defective data sites, and updating a vector reassignment table to cross-reference the defective data site to the spare data site.
In one embodiment of the present invention, the step of updating the vector reassignment table to cross-reference the defective data site to the spare data site further includes defining a vector cross-reference entry for the vector reassignment table associated with the list of defective data sites. The vector cross-reference entry includes a reference to the defective data site which is reassigned and a corresponding reference to the spare data site. The step of updating the vector reassignment table to cross-reference the defective data site to the spare data site also includes making a vector cross-reference entry in the vector reassignment table to cross-reference the defective data site with the corresponding spare data site.
In this second method, checkpoint information is saved after selected steps of the method, such that if an interruption occurs during the data recovery, the data recovery process may be completed. More specifically, if an interruption occurs while writing the block of data back to the marginally defective data site, on resuming operation after the interruption the defective data site is entered in a defect table and the data site is reassigned to the spare data site.