The present invention relates to method and apparatus for error corrections in the read/write cycles of plural memories in a parallel processor data processing system.
With the increasing size of random access memory utilized in association with a digital processor as part of an overall computation system it has become necessary to provide for the correction of errors occurring in the read/write operations to such large memories. A simple form of error correction involves the use of a parity bit as an auxiliary digital bit in a multibit word which was either a binary one or a zero based upon a function of the bit characteristics of each bit position in a data word. A difference, representing an error, between the digital word as read and the digital word as written into memory, for example, could be detected by the use of the parity bit in the case of a single, or odd number of bit errors in the word by recording the parity bit with the word as written to memory and comparing the retrieved parity bit with a reconstruction of the parity bit from the data word as read. If a difference occurred it was an indication of an error in an odd number of bits. Typically the probabilities of error were sufficiently low that the likelihood was insignificant of an error in all but a single bit. The use of the parity bit could not determine where the error existed but would alert the computer system to the presence of an error and auxiliary corrective steps, such as a second attempt to read data correctly, could be utilized.
More recently, more sophisticated error correcting codes were developed according to one or another algorithms which were a function all bits of each digital word. In writing and reading operations these error correcting codes are recorded in memory at the same address as the data word itself. Error detecting circuitry is operative on the digital word and error correcting code read from memory to not only identify the existence of an error but to spot which bit might be in error and to provide a correction of it. Depending upon the sophistication of the error correcting code, errors in one or more bit positions could be detected. One common methodology utilized with a sixteen bit processor and memory uses a six bit error correcting code generated, as a function of each bit in a data word, by a specific algorithm that was adapted to provide recognition of the most common error types.
Of course, except for complete redundancy, it is impossible to detect all errors that might occur in the read/write cycles of digital memories. It has, however, been found sufficient to utilize less than complete redundancy, such as a six bit correcting code on a sixteen bit data word, to greatly increase the probability of being able to store and retrieve data correctly from a memory despite the existence of a certain, predetermined set of errors to which such memories are typically prone.
With the advent of parallel processing digital computers, of which the above referenced U.S. Patent Applications are representative, the inherent speed limitations of serial processing of data even by extremely large and fast processors and associated memories is avoided by distributing the processing function into a large number of parallel processors and associated memories, each of which may be relatively small compared to the processor and memory sizes of large computers but which, when associated with each other through a heirarchical arrangement of communication networks permits them to effectively process vast amounts of data very rapidly.
The memory size of each memory associated with a single processor in such a parallel processor arrangement may be relatively small, for example on the order of 4K bits. While it is technically feasible to design a memory of that size which, by itself, would exhibit sufficiently low fault or error rates that no error correcting might be required, when considering that tens of thousands of such memories are typically employed in a parallel processor arrangement, the error likelihood increases dramatically as a statistically function of the entire assemblage of memories. As a result, it becomes necessary to apply error corrections to each of the thousands of such memories in a parallel processor arrangement. The cost of adding an error correcting system to each such memory greatly increases the costs of such a parallel processing system.