Error correction techniques are known in digital data processing systems, including memory stores such as disk drive data storage subsystems and tape drive data storage subsystems, and they are known in digital data communications systems. A methodology is selected and implemented which generates syndrome information from substantially contiguous blocks, sets or subsets of a digital data stream. By "contiguous" is meant that the blocks occur one immediately after another, with minimum interval or time lapse between or separating the blocks. The syndrome information is appended to the block, set or subset at an encoder end, typically within a data writing or transmission process. At a data reading or reception process, the received error correction syndrome information, called herein the "remainder", is obtained from the received data block and compared with zero. If there are any non-zero remainder values or bytes, an error is determined to be present. Depending upon the nature and amount of appended syndrome information, it is possible to detect and correct one or more error bursts in the data block.
An error burst is a set of adjacent bit positions which are in error. A single error burst is one occurrence of such erroneous bit positions within a data block, and can extend over the interleaves thereof. A double error burst is two separate occurrences of error burst within the data block, for example. Error correction systems and strategies are typically rated upon their ability to locate and correct single burst and double burst errors, as well as the speed of execution and a low probability of miscorrection of error bursts by the error correction strategy.
In the prior art approach, a syndrome converter is typically implemented as a shift register with taps and with external multiplier elements. In an external form, a summer combined multiplier terms from the shift register with the incoming data stream in order to generate the syndrome. In commonly assigned U.S. Pat. No. 4,730,321, to Machado, the disclosure of which is hereby incorporated by reference, an internal form was used in which the shift register includes the summing nodes, in order to permit logical elements to be shared and thereby minimize the number of structural elements required to implement the error correction hardware.
Many error correction code schemes employed within disk drives have employed a Galois field of one bit [GF(2.sup.1)]. In a one bit implementation, the multiplier terms are therefore either present or not present, and the summing node is implemented as an Exclusive OR gate. Several prior implementations have gone to a Galois Field byte basis (i.e., eight bits per symbol), as was done in the referenced U.S. Pat. No. 4,730,321 to Machado. In those prior approaches, the syndrome generator received eight bits in and put eight bits out. The referenced Machado '321 patent employed a Galois field in which the primitive element alpha.sup.1 term is 2B (Hex) in order to reduce the number of gates needed to implement the syndrome generator circuitry.
The general problem with a classic error correction approach, whether bit/input to bit/output or a byte input/byte output, is that when one or more errors are detected within a particular block, the system data stream has to be shut down during the interval required to determine the location and correct the burst error or errors within the block determined to include the error.
Particularly in disk drive data storage subsystems, it is now commonplace to encounter one or more programmed monolithic digital microcontrollers. These microcontrol elements are very powerful in carrying out operations and calculations appropriate to supervise data transducer head positioning and track following operations, and to supervise transfers of data, commands and status values with the host computing system, on an as needed basis.
Since errors and consequent error correction processes occur relatively infrequently in well designed and manufactured disk drives, the calculational power of the microcontroller is best adapted to the task of performing error location and correction operations on an as-needed, interrupt basis. For example, if data blocks are being read from the disk surface, reading of subsequent blocks ceases during the error location and correction process. As was the case in the referenced Machado '321 patent, after the data stream is shut down, the disk drive microcontroller obtains the ECC remainder for the data block having the error. Based upon the remainder, the microcontroller then calculates values for locating and correcting the error. Erroneous data values are then replaced by corrected data values inserted in the data block in a buffer memory. After correction, the data block is then available to be sent out to the host computer, and at that point the data stream may be permitted to resume its flow of data blocks.
In summary, the prior approaches, whether bit-based or byte-based, perform three common functions: they read the incoming data and compare the recovered remainder values with zero. If there is an error, they shut down the data stream, then they may perform a retry by e.g. rereading the data and resending it through the syndrome generator in an attempt to eliminate non-repeating or "soft" errors. If that doesn't work, i.e., a hard data error is present, they then perform an error correction operation. If the operation is determined to be successful, the data stream is thereafter reinitiated. If the error is determined to be uncorrectable, an error message is sent to the host computing system or communications system, and data transfer process is stopped indefinitely.
There have been several prior efforts directed to performing very rapid error correction. One architecture, believed to be embodied in a special hardware chip having many thousands of gates, is described in U.S. Pat. No. 4,782,490 to Tenengolts, the disclosure of which is incorporated by reference. While Tenengolts speaks of embodiments employing both hardware and firmware to implement high speed error correction techniques, no practical guidance is provided for implementation of on-the-fly error correction. For example, in the Tenengolts patent FIG. 8 "error trapping" embodiment, the same shift register used to detect errors, is also used after an error is detected for the error trapping process, thereby removing the availability of the shift register for the next block of incoming data in real time.
As used herein, the expression "on-the-fly" means an error correction process which is carried out with minimized data flow interruption, and which does not require one or more disk rotation latencies (revolutions) for carrying out the correction process. In order to perform ECC "on-the-fly", it is necessary to detect and correct the data errors, and to do so in a manner which does not stop the flow of data blocks during a typical transfer of multiple blocks. Conventionally, these two requirements have heretofore been conceptually lumped together. Actually, the present inventors have discovered that these two requirements are separate and separable aspects to the overall problem of error correction.
A syndrome generator is typically provided within a hardware implementation for error correction. This syndrome generator operates on-the-fly upon the data stream in order to generate an ECC syndrome for each predetermined data block. Once the syndrome is generated, it is typically appended at the end of the data block during the recording/ECC encoding process.
As taught by the referenced Machado '321 patent, the syndrome generator circuitry may also be used as a remainder recovery circuit during data block readback from the storage surface. The syndrome remainder bytes regenerated during readback from the data block or subsets coming e.g. from the disk surface are compared to zero by a hardware comparator. If the remainder bytes are not zero, an error is present in the readback data block or interleave thereof. In the Machado '321 patent approach, the data stream is then stopped, and the non-zero remainder bytes held in the syndrome generator shift registers are then passed via a bus structure to the microcontroller. The microcontroller then executes firmware routines to locate and correct the error or errors. The prior Machado '321 patent approach, while employing the on-board digital microcontroller to carry out error correction processes, did not do so on-the-fly. Detection of any error caused data flow to stop.
Another typical approach to perform on-the-fly error correction has been to provide one or more dedicated hardware processors whose task it is to calculate the correction of errors and to insert corrected data for erroneous data into one or more appropriate positions of the data block undergoing correction. Typically associated with dedicated hardware processors is a dedicated FIFO or buffer memory, typically four kilobits in storage capacity (512 bytes by 8 bits per byte).
An example of a hardware-based ECC architecture is given in the referenced Tenengolts U.S. Pat. No. 4,782,490. As best understood, Tenengolts provides a very general approach to error correction and cross checking in conjunction with FIG. 2, and no mention or discussion is provided concerning error correction on the fly. However, in conjunction with his FIG. 8 embodiment, the syndrome that detected the burst error is shifted through the original ECC recovery circuit's shift registers until the error is located. The number of serial shifts required until zero value residue bytes are encountered is counted. This count then provides the error burst location, while residue bytes immediately following the zero byte residue bytes provide a basis for determining the corrected data values to be inserted in the data stream. While Tenengolts proposes clocking the ECC shift registers at a four-times clock rate to permit "a single burst correction to be performed in the record transfer time" (Tenengolts Column 15, lines 47-48), it is not clear how such a procedure would enable the next contiguous block of data to be passed through the shift registers in real time (i.e. on-the-fly), unless considerable gaps existed between the data blocks, or a block transfer delay was actually programmed into the data stream with a FIFO block buffer, etc. While the Tenengolts FIG. 2 approach is stated to be very fast, it is also admittedly complex. The FIG. 8 approach is said to be slower, but sufficiently fast, even though the decoding process is said to be implemented in firmware. While a firmware implementation is mentioned, no details, such as a firmware code listing, are given in the Tenengolts patent. The Tenengolts patent also describes cross checking and ID field (header) error detecting techniques within the constraints of the hardware implementation given in FIG. 2, for example.
Another high speed ECC architecture, involving dedicated parallel processors operating in pipeline manner to avoid inter-step transfers is described in U.S. Pat. No. 4,567,594 to Deodhar, the disclosure of which is also incorporated by reference. The technique described in Deodhar U.S. Patent No. divides the Reed-Solomon decoding process into a sequence of steps which are carried out by the pipeline dedicated processors with a minimum of inter-step parameter transfers.
A more enlightened prior art approach to error correction makes use of the data block buffer associated with the data controller function. This approach uses a DMA channel in order to access and control the data block buffer. The ECC hardware generates a DMA request to the block buffer thereby to gain access to the erroneous data cells which are to be corrected. This approach recognizes the advantage of using the already existent data block buffer memory for the error correction process. This is the approach followed in the referenced, commonly assigned U.S. Pat. No. 4,730,321 to Machado. However, this prior approach did not achieve "on-the-fly" error correction.
In addition to the inability to perform single burst error correction on-the-fly, another drawback of the approach described in the referenced Machado '321 patent arises from the probability of misdetection and consequent miscorrection of data in the double burst correction mode, essentially limiting its utility to single burst correction. While the occurrence of random errors within the correction capability of a particular Reed Solomon error correction algorithm will result in a zero miscorrection probability, if the number of random errors exceeds that correction capability, the miscorrection probability becomes quite high, approximately 0.233. This means that almost one quarter of the errors statistically occurring above the correction capability will be misdetected and miscorrected. Known prior techniques which have been employed to reduce the misdetection probability have included cross checks, such as including check sums within the data block. However, such cross checks have not been particularly powerful, nor have they been able to detect shuffling or rearrangement of data values within the data block.