The invention relates generally to an apparatus and method for correcting data recovered from a peripheral data storage system in a data processing system. The invention more particularly relates to correction of data files recovered from disk drive arrays.
A peripheral data storage system is a type of secondary memory accessible by a data processing system, such as a computer. Memory, in a computer, is where programs and work files are stored as digital data. Computer memory can include either, and commonly includes both, moving-type memory and non-moving type memory. Non-moving memory is typically directly addressed by the computer's central processing unit. Moving memory, such as magnetic disk drives and magnetic tape, is not directly addressed, and is commonly referred to as secondary memory or peripheral memory.
Moving memory typically has much greater data storage capacity than directly addressed memory and has much longer access times. Moving memory is also typically not volatile. That is, it survives turning the computer off. Non-moving type memory is typically faster and more expensive per unit of memory than moving-type memory, and has less capacity. Moving-type memories are generally used for long-term storage of large programs and substantial bodies of information, such as a data base files, which are not in constant use by the computer, or which are too bulky to provide short term direct access memory capacity for.
The storage media of the moving-type memory are physically alterable objects. That is to say, they can be magnetized, grooved, pitted or altered in some detectible fashion to record information. Preferably the storage media is at the same time physically resilient, portable, cheap, of large capacity, and resistant to accidental alteration. An example of an analogous medium is a phonograph record where a wavy spiral groove represents an analog information signal. The various species of storage media used in moving-type memory for computers include magnetic tape, floppy disks, compact disk-ROM, Write-Once, Read-Many optical disks and, most recently, erasable magneto-optic disks. Each of these storage media exhibit detectable physical changes to the media representing binary data. To read, and where applicable to erase and write data to the media, mechanical apparatuses are provided which can be directed to the proper location on the physical media and carry out the desired function.
A magnetic disk drive includes a transducer, a magnetic media disk and associated electronic circuitry to drive or monitor the transducer and transfer the data between the physical medium and the computer to which the drive is connected. The conversion of data from electronic signal to physical feature for storage, extended retention of the data as physical features, and the conversion of physical feature back to electronic signal offers numerous opportunities for error to be introduced to the data record. Transducer and disk present to one another surfaces that are in essentially constant motion. While the interaction between transducer and media is magnetic, the interaction between their surfaces is a mechanical one, affected over time by factors such as friction, wear, media flaking and collisions between the transducer and the media. All of these occurrences can be sources of error. In addition, both individual transducers and magnetic media surfaces are subject to failures, such as opening of a transducer electromagnet winding or imperfections in the magnetic media surface. Mechanical failure can affect the magnetic interaction of the transducer and the media, and consequently can affect the ability of data recovery circuitry to read data records from the media surface. The faults described above, and others too numerous to discuss in any detail here, can result in loss of data records ranging from one bit to all of the data of the record stemming from failure of an entire drive. The possibility of error in readback, and error or loss of data on the disk, has led to the incorporation of error detection and correction techniques in disk drives.
Data in digital processing systems is typically stored to a disk drive by sectors. Such sectors will include certain redundant information to be used for checking the accuracy of the record and possible correction of the sector upon readback. An example of such redundant information is "Error Checking and Correcting Code", sometimes referred to as "Error Correction Code" or "ECC".
An Error Correction Code for a sector or record will literally comprise data bits supplementing the regular data bits of the record. Where ECC is used each record conforms to specific rules of construction which permit use of the supplemental bits to detect and, under certain circumstances, correct errors in the record. In actual use, error syndromes are generated from data and the ECC upon readback of the sector. Non-zero error syndromes indicates that error is present. Where error is "random", which for purposes of this patent means error within the capacity of the ECC to correct, the error syndromes can be used to correct the sector. However, operation on the error syndromes to correct error is relatively time consuming in terms of performance of the disk drive in a computer. Analysis of the error syndromes to determine whether error is correctable, that is whether error is random or massive, can be done in a relatively short period of time.
A second example of redundant data is parity data. Parity data is generated for a logically associated group of binary bits, typically by a modulo addition operation on the group to generate a check digit. An example of a check digit in a binary system is where the digits are logically "ORed", generating a "1" if the number of "1's" in the group is odd or a "0" otherwise. Parity, absent information indicating location of an error, is usable only to identify the existence of error. However, given independent identification of location of the error in a group, the use of parity permits rapid correction of the error. Only one bit of error per parity bit can be corrected.
A computer can use more than one disk drive for the storage of data. Alternatively, a plurality of disk drives can be organized to operate together and thereby appear to the computer as one peripheral storage unit. The term disk drive array is used in this patent to indicate a group of disk drives operating in a parallel, synchronous fashion allowing transfer of data bits of record in parallel to the individual drives of the array and appearing to the computer as a single, data storage peripheral device. An interface operating between the disk drive array and the data processing system transfers data to and from storage in parallel to increase data transmission bandwidth.
Such parallel, synchronized disk drives have characteristics offering opportunity for improvement in redundancy for stored data. Simplistically, data on one disk drive can be mirrored on another. In U.S. Pat. No. 4,775,978, assigned to the present assignee, the present inventor taught a "Data Error Correction System" for a mass data storage system comprising an array of synchronized disk data storage units. U.S. Pat. No. 4,775,978 is expressly incorporated herein by reference. The data storage system receives data blocks from a host data processing system for storage. A data block divider stripes the data among a plurality of the disk data storage units as columns. A data word is divided among the plurality of disk drives one bit to each drive. Data stored to the same address on each disk are supplemental by a column of parity data stored to yet another data storage unit. Each data column, assigned to a particular disk drive, is supplemented with an error correction code. Upon readback, two levels of data error correction are provided. Data columns are read and stored to buffers. Error syndromes are generated upon readback and used to correct random error. When a data storage unit fails, or where error correction code for a column is inadequate to correct errors in a column, the column (i.e. data storage unit) where the error occurs is identified and the parity data in combination with the data of the other columns is used to reconstruct data in the missing column. The system provides a highly fault tolerant data storage system.