This invention relates to a device for detecting and correcting hardware errors in the storage and recovery of data. More specifically, it relates to the design of an error correction system for real-time, fault-tolerant error detection and correction in a disc memory system.
Generally, error correction systems add redundant information in accordance with an error correction code, or ECC, to each data block which is written on a disc. This redundant information is then used to determine the integrity of the data block when it is read from the disc at a later time. The use of an ECC makes it possible to detect, locate, and correct errors in a data block provided that the extent of the error is within the capabilities of the particular ECC. Many such systems have been developed in the prior art. See, for example, "Burst-Correcting Codes with High-Speed Decoding" by R. T. Chien in IEEE Trans. On Information Theory, Vol. IT-15, No. 1, pp. 109-112, January 1969, and Error-Correcting Codes, 2nd Ed., W. W. Peterson and E. J. Weldon, MIT Press 1972.
In some error correction schemes, the computer's central processing unit (CPU) or a separate microprocessor is used to correct errors. In either case, detection of an error interrupts real time data flow as the processor corrects the error. Furthermore, although the time required to make the correction may be of short duration, the processor gets out of synchronism with the data, causing a halt in data flow until the disc makes a complete revolution and arrives again at the next data block. This results in a considerable delay before data flow can be re-established.
Another solution to the problem of interrupting the flow of data to the CPU involves implementing a microprocessor outside of the CPU and providing sufficient memory to retain several blocks of data. Typically, a linear feedback shift register (LFSR) is used to calculate an "error syndrome", i.e., the numerical result of the ECC, as a data block is read from the disc into a first buffer memory. Once an error syndrome is calculated, it is then passed to the microprocessor which determines the error pattern and error location and does the correction in local memory. While the microprocessor is correcting errors in this particular data block, a second data block is entering a second buffer memory and the LFSR is calculating its syndrome. Finally, the original data block in the first buffer is corrected by the microprocessor, and the corrected data is sent to an output buffer and then to the CPU. During this latter stage, the syndrome for a third data block is being calculated by the LFSR and corrections based on the syndrome for the second data block are being implemented in the microprocessor. In this way, the CPU receives an uninterrupted stream of data.
A significant problem with this solution is the hardware complexity and the associated costs of implementing such a combination of buffer memory, microprocessor, concomitant timing and control circuits, and interface elements. In addition, there is still a significant "pipeline" delay as the initial block of data is shifted serially through as many as three data blocks of memory.
Another major problem with existing error correction systems is assuring the fault tolerance of the error correction system as it encodes data written on the disc. A failed ECC encoding circuit can potentially corrupt the entire storage area. Hence, the time required to discover the error is unbounded, since an encoding fault may not be discovered until the data is decoded. In a typical system, backup data is only available if the disc can immediately notify the system that an encoding failure has occurred before an alternate copy of the data is destroyed. This type of fault tolerance problem is typically overcome by double or triple redundancy in the encoding hardware. The obvious disadvantage in this situation is added complexity and cost.
Heretofore, no simple, inexpensive solution to the problem of real-time fault-tolerant error correction has been developed. Furthermore, prior implementations of circuits used for both decoding and redundant encoding have not made optimum use of all available hardware.