The present invention relates generally to storage and retrieval of data on magnetic media and more particularly to a method of detecting a particular fault as a cause of data transfer error.
Controller electronics for a tape drive may include error correction and detection circuitry (ECC) to detect and correct data transfer errors in data retrieved from and written to a tape on a tape drive. ECC is applied to data xe2x80x9con-the-flyxe2x80x9d as data is transferred to or from the media. Severe faults may render some errors unrecoverable by the ECC engine compromising integrity of the data. Tape drive firmware may include a module that may be activated to recover data which has been shown to be non-recoverable employing the ECC.
A fault causing a data transfer error may originate in the media, the read/write transducers, or drive electronics. However, it may be difficult for the controller to determine where the fault resides and consequently apply an effective recovery. Existing non-ECC error recovery methods typically consist of a sequence of predetermined error recovery procedures (ERP). An ERP may include: multiple attempts to read or write the data; a re-tensioning of the tape followed by an attempt to reread the data; changing the channel filter parameters and retry; tape head cleaning operations and other similar rehabilitative measures. These ERP are applied in a predetermined sequence regardless of the nature of the fault that caused the data error.
In an attempt to reread the data, the tape is reversed and repositioned back to a ramp-up point before the target data block and accelerated to the target data block to be read (or written) again. Every time an attempt to reread the data fails, the tape is reversed, repositioned and then forwarded for the next attempt to reread the data. If the non-ECC recovery consists of a sequence of 20 retries, then the tape has to be repositioned 20 times making the error recovery attempt very time consuming. The same sequence of ERP is applied regardless of the nature of the fault that caused the read failure. Some of the ERP may not remove the fault and, to that extent, they are applied unnecessarily wasting time. In the event that the data transfer error is caused by debris at the head/media interface, multiple read or write cycles most likely will not result in recovery of data.
Other solutions for recovering lost data not recovered by ECC have relied on a brute force approach to recover data. These methods are extremely memory intensive and hence costly.
The present invention is directed to a method for identifying faults that contribute to a data transfer error. In one aspect of the invention, the method identifies a relatively high probability that a specific fault is causing a data transfer error. In one case, the method is applied to identify a relatively high probability that data transfer error is caused by debris at a head/media interface. The method includes a data error comparison step followed by the application of an error recovery procedure or a sequence of error recovery procedures having a relatively higher probability of eliminating the fault allowing quicker recovery of the data.
Normally, data is written on tape in blocks. According to the present invention, a short-term error sample is defined as the number of bytes of data transfer error in a predetermined number of data blocks, divided by the total number of bytes transferred in the predetermined number of data blocks. A window is defined by a predetermined number of short-term error samples. A short-term error sample process monitors the predetermined number of short-term error samples within the window. The long-term error rate is defined as the total number of bytes in error for all data blocks transferred divided by the total number of bytes transferred in all data blocks.
During normal operating conditions, where the head/media interface is free of debris, short-term data error rate samples may exhibit values slightly and randomly larger or smaller than the long-term error rate value. As debris accumulates gradually at the head/media interface, the electrical signal picked by the transducer weakens gradually and consequently the short-term error samples will gradually degrade compared to the long-term error rate. Short-term error sample degradation may be evidenced by a weakening signal picked up by the transducer due to accumulation of debris. As debris accumulates, the number of bytes in error increases and, consequently values for short-term error rate samples increase. Degradation may also be indicated by a gradual yet consistent increase of short-term error sample values. Alternatively, in the event that debris attaches at the head-media interface abruptly, values for all short-term data error samples will be greater than the long-term error rate.
In one embodiment of the invention, short-term data error samples and a long-term data error rate are monitored and calculated. In the event that the ECC engine fails to recover data, a non-ECC error recovery module may be invoked. The non-ECC error recovery module compares short-term data error samples and their deviation from the long-term error rate. If values for short-term error samples have deteriorated gradually compared to the long-long-term error rate, then it is likely that the data transfer error is caused by debris at the head/media interface. Since debris at the head/media interface can affect one or more channels, long-term error rates and short-term error samples may be monitored for all channels. For example, short-term error samples may be defined as S(j,k) where j is the channel number and k is the sample number. Long-term error rate may be defined as L(j), where j is the channel number. If all S(j,k) greater than L(j), then a head-clean cycle is invoked followed by an attempt to reread the data. If S (j,1) less than S(j,2) less than  . . .  less than S(j,10)), then a head-clean cycle is invoked followed by an attempt to reread the data. In either case, the head-clean cycle operates to remove debris accumulated at the head/media interface. If short-term error samples have neither deteriorated gradually nor abruptly then the data transfer error is commonly caused by a transient condition and a simple attempt to reread the data is oftentimes sufficient to recover data.
This invention may reduce the time to perform non-ECC error recovery procedures as unnecessary error recovery procedures are not performed. This invention may be employed in a linear tape drive where multiple read/write elements are used to read/write data simultaneously on data tracks on the magnetic tape. The method of debris detection according to the present invention may reduce the time taken to recover data, increasing the data transfer rate performance. The invention is simple and consequently reduces the amount of system memory used, reducing the cost of implementation.