I. Field of the Invention
This invention relates generally to fail-safe computer memories. More particularly, it relates to a method and apparatus for use in a fail-safe modular memory wherein an addressing mechanism distributes checksum blocks across memory modules to provide a load balanced fail-safe memory, and wherein after failure of a memory module the addressing mechanism redistributes checksum blocks across the remaining memory modules providing a fail-safe memory that is close to being load balanced.
II. Description of the Prior Art
As the requirements for computing power and availability grow, the interconnection of a plurality of processors to support the computing and availability requirements becomes increasingly important. Typically, multiple processors are connected to a number of shared memory modules through an interconnection network. To meet the availability requirement in such a system, it is imperative that the failure of a memory module should not cause the loss of data contained in that module, and thus cause the entire system to fail.
Tolerance to module failures in a modular memory may be achieved by declaring one of the modules as a checksum module containing checksums of data in the other modules. On a module failure the data lost may be reconstructed from the remaining modules and the checksum module, assuming that not more than one memory module fails before the lost data is reconstructed. If the checksum module fails then the checksums may be reconstructed from the data in the other modules. On every update to the memory the checksum in the checksum memory module has to be reconstructed.
While this may be a workable scheme for a uni-processor that generates a single sequential memory request stream, the single checksum module proves to be a bottleneck for multiple processor systems that generate multiple concurrent memory write request streams at a high rate, because the checksum module is updated for every write. Thus, there is a need for a fail-safe memory that can simultaneously be accessed by a number of processors, and whose data can be reconstructed in case of failure of one module. Further, it should be easy to integrate or remove a memory module from the system.
The most relevant art relating to the solution of this problem are reviewed below.
In U.S. Pat. No. 3,876,978, Bossen et. al., one disk memory stores checksums of data located in the other disk memories. Every time data is updated, its corresponding checksum in the checksum module must be updated. Therefore, the write traffic to the checksum disk is the sum of the write traffic to all the other disks, leading to a (performance) bottleneck at the checksum disk. The present invention eliminates precisely this bottleneck.
In U.S. Pat. No. 3,742,459, entitled `Data Processing Method and Apparatus Adapted to Sequentially Pack Error Correcting Characteristics into Memory Locations`, Looschem describes a scheme whereby a fixed size codeword is generated for every memory word. Multiple codewords are packed into a location in memory. The maximum number of codewords that can be packed into a memory location determines the amount of extra memory required to store the codewords. In contrast, the present invention allows for any number of words to share the same extra word to store redundant information for fault tolerance. Another fundamental difference between U.S. Pat. No. 3,742,459 and the present invention is the following. In U.S. Pat. No. 3,742,459, the error correcting code for a word of data is such that it can correct for a few errors in the word caused by a transient fault. If the memory module in which the data word is located fails entirely, or if more than a certain number of bits of the word are corrupted, the scheme cannot recover the data. By contrast, the present invention is devised precisely to handle the failure of an entire module.
In U.S. Pat. No. 4,459,658, entitled `Technique for Enabling Operation of a Computer System With a Consistent State of a Linked List Data Structure After a Main Memory Failure`, Gabbe and Hecht describe a scheme to recover free lists in data bases where data base recovery is implemented by shadow paging--a scheme outlined in the publication "Physical Integrity in a Large Segmented Database," by R. A. Lorie in the ACM Transactions on Database Systems, Vol. 2, No. 1, March, 1977, pp. 91-104. In this scheme every data item is duplicated in two different memory locations on two different media (main memory and disk) doubling the requirement for memory. In contrast, the present invention proposes the use of only a fraction of the original memory size for storing redundant information.
Carter et. al., in U.S. Pat. No. 3,737,870 entitled `Status Switching Arrangement`, use m+s of n bit memory modules to store an encoded mxn bit word. Bit parity is used to protect against module failures. If a module failure is detected, parity checking is turned off, since it could be violated; the bits from the remaining unfailed modules are read, a spare module is integrated into memory, parity recomputed and then the entire contents of the memory re-written into the new configuration of modules (unfailed modules and spare). They describe a scheme to correct the contents of the memory upon a module failure and automatically bring in one of the s spare modules to replace the one that was just lost by failure. This scheme will exhibit performance bottleneck at the memory for a plurality of processors, since all the modules need to be accessed at every memory request from any processor.
In U.S. Pat. No. 3,436,737, entitled `Error Correcting and Repairable Data Processing`, Pomerene et. al. propose the use of s spare modules along with n memory modules to store n bits of information per word. Each bit of the word is stored in a separate memory module. The s+n bits represent the n bit word encoded with s bits for error correction or detection. On a module failure the bits available from the remaining modules are used to reconstruct the word. Again, all the s+n memory modules need to be accessed for every memory request preventing the servicing of a plurality of concurrent memory requests from a plurality of processors, as addressed by the present invention.
In copending application Ser. No. 068,862, filed July 2, 1987 `Memory Unit Backup Using Checksum`, by Y. Dishon and C. J. Georgiou, a modular organization of memory is described, with a checksum scheme. Here all the checksums are stored in one separate checksum module. Every time any memory location is updated, the checksum corresponding to this location has also to be updated in the checksum module. The write traffic to the checksum module is the sum of the write traffic to all the data modules. Hence, if identical technology is used to design all modules the checksum module is a bottleneck at any reasonable level of utilization. The alternative is to make the checksum module of higher grade technology and to provide additional bandwidth to it. This increases the cost of the memory.
In the U.S. Pat. No. 4,092,732, entitled `System for Recovering Data Stored in Failed memory unit`, by Ouchi, the unit of access is a record. A single record is divided into a number of segments and a checksum segment is added. The record segments and the checksum are on different disk drives. Now, for any read or write access, all of the segments are read or written. This implies that one access causes a seek+latence+transfer time on each of the record segment drives and on the checksum drive. This would lead to a large load on the disks and poor throughput. By contrast, the unit of access in the present invention is a block of data that is located in a single memory module. A set of blocks has a common checksum located on some different memory module. Now each access involves only the memory module being accessed and the checksum module. This fundamental difference between the schemes leads to tremendously different performance characteristics.