Dynamic Random Access Memory (DRAM) is used extensively in a variety of applications, especially in conjunction with digital microprocessors. In a typical configuration, several Central Processing Units (CPUs) will be interfaced with a Processor and Memory address device (PMA), as shown in FIG. 1. The PMA is interfaced with one or more Processor and Memory Data devices (PMD). Each PMD is interfaced with a plurality of Memory Modules (MM). The PMA functions to arbitrate the addresses received from each CPU, and directs each address to the correct PMD. The PMD receives the address and determines where within the MMs to read or write data. Each MM corresponds to a slice of to) data and is comprised of DRAM. The PMD also performs error correction operations.
The number of DRAM chips required to provide the needed memory capacity in a multi-processor system is large. The probability of a DRAM failing compared to the other components in the system is high. DRAMs can have single or multi-bit errors for a variety of reasons. Random single bit errors can often be caused by radiation bombardment. Cross talk on lines connected to the DRAM may also cause errors. Further, an entire DRAM device may fail. It is therefore desirable to provide some redundant memory, coupled with error detection and correction logic to minimize the adverse effect of the occurrence of errors. Preferably, an error detection and correction scheme minimizes the amount of redundant memory required while minimizing the computational overhead require for detection and correction. Typically, an error correction scheme is employed which reduces the probability of uncorrected errors to some acceptable level.
The classical approach to detection and correction of errors is by use of an error correction code (ECC). An error correction code associated with a slice of data is stored and utilized to determine if an error has occurred in the slice and to then correct the erroneous bit. Typical ECCs provide guaranteed single bit error correction and double-bit error detection. Additionally, many multi-bit errors can be detected. The weakness of these codes is that some multi-bit errors will appear to be single-bit errors and some multi-bit errors will not be detected at all (a no-error syndrome). More elaborate codes have been created which provide better detection and correction capability. These codes further reduce the possibility of data corruption at the expense of greater computational overhead.
Another solution targeted at an entire DRAM chip failure (either as a transient failure, or a permanent failure) is achieved by distributing the ECC across numerous DRAM chips so that no two bits covered by a single ECC domain are from a single DRAM chip. Thus, if the ECC code covers 64 bits of data, then all 64 bits of data are from different DRAMs. In this approach, a block of data is written to a DRAM in the memory system. Each bit of the block belongs to a different ECC domain and only one bit of each ECC may be stored on the DRAM. This approach works well in solving the problem of a single DRAM failure, but has some weaknesses. First, once a DRAM fails, any future problem (single bit or multi-bit errors) will cause the data to be non-correctable. This implies that field service personnel must quickly replace the failing DRAM component to ensure guaranteed levels of system availability. The second weakness of this approach is that since each bit of a DRAM memory line must belong to a different ECC domain, a large number of DRAMs must be addressed for error detection when a line of data is read. This results in significantly increased power consumption.
An alternative approach to error correction has been adapted from techniques used to solve disk errors. This approach is referred to as the RAID technique when applied to disks (Redundant Array of Independent Disks) and as checksum techniques when applied to memory. Checksum mechanisms employ a redundant DRAM and a checksum for data reconstruction when an error is detected. The checksum is obtained by forming the exclusive-or (XOR) operation between the data stored in a set of N DRAM blocks or MMs. The resultant checksum is then stored in a redundant MM or DRAM block, which has a capacity at least equal to the capacity of the other N DRAM blocks or MMs. More specifically, the data at each address, x, of each of the N MMs (or DRAM blocks) of data are XOR-ed to form a checksum that is stored in a corresponding address, x, of the redundant MM (or DRAM block). If a MM or DRAM block that contains data fails, then the data that was stored therein may be reconstructed by XORing the remaining DRAMS together with the checksum stored in the redundant DRAM block. This backup operation is typically performed by the PMD.
One prior art approach stores an entire memory line into each memory module. A disadvantage of this approach is that if a DRAM block fails, the entire process must be halted until the data of the failed DRAM block is reconstructed. Another disadvantage of this method is that in order to provide uniform access across all memory modules in the system, the DRAM used to store the checksum must be rotated among all of the DRAM blocks. This results in considerable additional complexity and computational overhead. It is also noted that the full bandwidth required for cache access is demanded of each DRAM block in this prior art approach.
Therefore, it is desirable to devise apparatus and methods for reconstructing lost data in real time without having to stop a process for reconstruction of lost data, and without having to rotate the checksum storage among different modules to achieve uniform bandwidth access.
An object of the present invention is therefore to provide methods and apparatus for reliable memory which do not require halting an application in operation to reconstruct lost data. Another object of the present invention is to provide uniform access of all memory modules in the memory system without increased complexity and computational overhead.
There are multiple approaches through which a line of memory can be stored into memory modules upon which checksum operations can be performed. A prior art approach is described in U.S. Pat. No. 4,849,978, which is incorporated herein by reference, which approach stores an entire memory line into each memory module. A disadvantage of this approach is that if a DRAM block fails, the entire process must be halted until the data of the failed DRAM block is reconstructed. Another disadvantage of this method is that in order to provide uniform access across all memory modules in the system, the DRAM used to store the checksum must be rotated among all of the DRAM blocks. This results in considerable additional complexity and computational overhead. It is also noted that the full bandwidth required for cache access is demanded of each DRAM block in this prior art approach.
The following inventive approach describes a system which need not be halted for reconstruction of data in a DRAM and in which the DRAM used to store the checksum need not be rotated among all the of the DRAM blocks.
The inventive approach is to store a slice of a memory line in each of N memory modules. According to one aspect of the present invention, a redundant memory slice is provided in addition to N data slices, where N is an integer. Each slice of memory may be implemented by separate DRAM chips. The redundant slice stores a checksum which may be used to reconstruct the data of any one of the N slices. The checksum is formed by XORing the N data slices together in a bit wise fashion. Thus, bit zero of the N data slices are XOR-ed together to produce bit zero of the redundant slice. Similarly, bit n of the redundant slice is created by XORing bit n of the N data slices. The XOR logical operator has the property that by XORing the checksum stored in the redundant slice with the data in Nxe2x88x921 of the data slices, the result will be the data that was stored in the remaining Nth data slice.
According to another aspect of the present invention, an error correction code (ECC) is provided for each slice, including the redundant slice, for single bit error detection and correction on a slice by slice basis. The ECC is also used to detect multi-bit errors occurring in a slice. If the ECC indicates a single bit error, the error is corrected. If the ECC indicates a multi-bit error, then the data for that slice is reconstructed using the checksum stored in the redundant slice.
According to another aspect of the present invention, data stored in the memory system is distributed across all of the N data slices. For example, if a memory line of data to be stored is 80 bits in length and there are 8 data slices, then ten bits of the block will be written to each data slice. This data is XOR-ed bit by bit to generate 10 bits of checksum to be stored in the redundant slice. Since all slices are accessed each cycle, uniform access of all memory modules is achieved. Since correction of multi-bit errors can be done on a memory line by memory line basis as each block is read from memory, the system need not be halted for reconstruction of the data in an entire DRAM.
The above described approach protects against failures in memory module data paths to XOR gates within an PMD chip.
According to another aspect of the present invention, the present invention can be adapted to small memory systems where it would be impractical because of cost to implement redundant memory and checksum operations. In this case, the present invention will still provide ECC type error detection and correction, but the redundant memory and checksum functions may be omitted.
Note that any one of the memory modules provided in the inventive memory system may be utilized as the redundant slice. Since all slices are accessed uniformly, there is no need for rotation of the redundant slice among the slices.
Therefore, it is an advantage of the present invention that an application need not be halted for reconstruction of data in an entire DRAM.
It is a further advantage of the present invention that there is no need for rotation of the redundant slice among the slices of a memory line.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims.