1. Field of the Invention
The present invention relates generally to data processing system storage techniques and more particularly relates to memory storage systems having provisions for error detection and correction.
2. Description of the Prior Art
Errors within data processing equipment tend to occur from both transient causes and permanent failures. Because of the predominantly digital nature of the data processing system, such errors must be monitored to provide accurate and verifiable results. Some systems, such as described in U.S. Pat. No. 4,410,942, issued to Milligan et al., deal with this concern by detecting errors so that the effected process may be repeated, hopefully without error. U.S. Pat. Nos. 4,139,148 and 4,163,147, both issued to Scheuneman et al. and incorporated herein by reference, teach memory systems wherein errors may both be detected and corrected to prevent the need to repeat the process. In these systems, single bit errors may be corrected and double bit errors detected.
As non-volatile core memory gave way to the newer volatile semiconductor technologies, error detection and correction for memory storage systems became essential. U.S. Pat. Nos. 4,058,851 and 4,112,502, both issued to Scheuneman, describe ways of minimizing the memory access time penalties associated with such error detection and correction.
In addition to error control for memory systems, U.S. Pat. Nos. 4,652,993 and 4,962,501, issued to Scheuneman et al. and Byers et al. respectively, teach techniques for control of errors occurring in transfers within a bussed architecture. U.S. Pat. No. 4,757,440 issued to Scheuneman and U.S. Pat. Nos. 4,697,233 and 4,600,986, issued to Scheuneman et al.,. are directed to error control for both data and addressing of small temporary memory stacks.
The physical characteristics of the storage or transfer device undergoing error control most often determine the extent and the nature of the error detection and/or error correction method. U.S. Pat. No. 4,644,545, issued to Gershenson, proposes a special purpose error coding scheme especially adapted to disk systems. A tape system employing complete redundancy is suggested in U.S. Pat. No. 4,772,963, issued to Van Lahr et al. U.S. Pat. No. 4,745,605, issued to Goldman et al., shows error detection and classification of microcode control words. Memory module backup is provided in U.S. Pat. No. 4,849,978, issued to Dishon et al.
An early form of error control is through the use of complete redundancy. U.S. Pat. No. 4,228,496, issued to Katzman et al.; U.S. Pat. No. 5,099,485, issued to Bruckert et al.; and U.S. Pat. No. 4,942,575, issued to Earnshaw et al., show examples of memory systems employing complete redundancy. Except for certain specialized applications in the military and aerospace fields, such complete redundancy is seldom cost effective. In fact, for most systems, complete redundancy is even less effective than much less expensive techniques.
One method of enhancing overall system reliability which employs less than complete redundancy is through the use of a number of smaller modules combined to perform a larger function. In this manner, failure of a given module causes a reconfiguration resulting in diminished capacity but not loss of the entire resource. U.S. Pat. No. 4,772,085, issued to Flora et al., shows a storage subsystem utilizing a number of small disk drives to produce an effectively large storage capacity. An archival storage unit with fault tolerance is shown in U.S. Pat. No. 3,876,978, issued to Bossen et al.
As memory element technology has developed, the modularized approach has become the architectural standard. U.S. Pat. No. 5,117,428, issued to Jeppsesen, III et al., teaches a semiconductor memory subsystem which utilizes the modularity to provide expansion in both horizontal and vertical dimensions. Implementing modularized semiconductor memories offers the opportunity to provide on-chip error detection and correction as taught by Leslie in U.S. Pat. No. 4,739,504 and 4,739,585. This is also used in U.S. Pat. No. 4,993,028, issued to Hillis. System level implementation of large scale semiconductor memories is taught in U.S. Pat. No. 4,633,434, issued to Scheuneman and U.S. Pat. No. 5,060,145, issued to Scheuneman et al., both incorporated herein by reference.
An addressing scheme employing error checking for such a large scale memory is taught in U.S. Pat. No. 4,727,510, issued to Scheuneman et al. Error correction of the address word is also provided in U.S. Pat. No. 4,092,713, issued to Scheuneman. Correction of the address word is shown in U.S. Pat. No. 4,918,695 and U.S. Pat. No. 4,926,426, both issued to Scheuneman et al. U.S. Pat. No. 4,649,475, issued to Scheuneman and U.S. Pat. No. 4,918,696, issued to Purdham et al. show protection from control information failures.
Arrangement of the modules within the memory may have an impact upon the failure tolerance of the system. U.S. Pat. No. 5,128,941, issued to Russell, shows a memory system in which the module addressing is irregular. Effectiveness may also be enhanced through the use of multiple error control schemes. A technique employing both vertical and horizontal parity checking is suggested by U.S. Pat. No. 5,103,424, issued to Wade. U.S. Pat. No. 4,531,213, issued to Scheuneman, teaches embedding a first level error check within a second level check.
Though the prior art has many examples of efficient single bit error correction/double bit error detection, correction of multiple bit errors continues to be particularly troublesome. The most common methods of dealing with multiple bit errors in the prior art involve a loss of data. Those prior art systems which attempt to correct multiple bit errors without data loss tend to require substantial amounts of additional hardware.