Dynamic random access memory (DRAM) devices constitute the main memories of most modern computers because their performance characteristics, e.g. the ability to quickly store and access information, are crucial to the efficient operation of the computer. A DRAM device is organized as one or more rectangular matrices, each of which is addressed in terms of rows and columns of storage elements. Specifically, each matrix consists of an array of storage elements, each holding one (1) bit of data that can be loaded into, or retrieved from, each storage element as required. The access time is generally the same for any bit in the DRAM regardless of its location.
Magnetic disks have traditionally dominated as secondary mass storage devices. Data is stored on magnetic disks in groups of blocks or sectors, which are the smallest units accessed in read or write operations. Access time to the data depends upon the location of a sector on the disk and is at least an order of magnitude longer than that of a DRAM. However, DRAM storage is more expensive than magnetic storage.
An alternative to magnetic disks for secondary storage is solid-state disks. A solid-state disk composed of an array of DRAMs provides high transfer rates, i.e. block transfers to and from memory, and fast access times. The block-access nature of the solid-state disk allows .error correction to be spread over more than one word, which translates into greater error recovery. Greater error recovery, provided by powerful error correction codes (ECC) such as Reed-Solomon codes, allows the "disk" to be constructed from less reliable, and thus less expensive, DRAMs without sacrificing product availability.
Reed-Solomon codes provide effective correction for the types of errors experienced on secondary storage media. Prior to storing data on a disk, a block of data is converted into error correction code symbols consisting of data and check symbols. More specifically, the check symbols are appended to the data symbols and the resulting "code block" is stored. When the code block is retrieved, the check symbols are used in the detection and correction of errors in the data symbols. The Reed-Solomon codes correct the errors on a symbol basis.
Less reliable DRAMs may be purchased at considerable price advantage because they have internal defects that are poorly defined; that is, certain groups of DRAMs, having failed the manufacturer's tests, are characterized as "partially bad" devices. The defects in those devices may be random or they may be correlated, i.e. in the same locations in the respective devices in an entire batch. The defects are manifested by the recording of erroneous data values in the storage elements. Unfortunately, when they are correlated, the number of erroneous symbols in a block of data exceeds the correction capability of a reasonable number of check symbols.
One approach to organizing data in a memory to optimize the memory's tolerance for system faults is described in patent application Ser. No. 376,357, filed Jul. 6, 1989, titled FAULT TOLERANT MEMORY, by Francis Reiff and assigned to the assignee of this invention. As described therein, 10-bit symbols in each block of data are organized vertically in RAM devices so that the same bit positions of ten different bytes of a given RAM device contain the ten bits of a symbol. With this technique, failure of one of the data lines connected to the memory affects only 32 symbols out of a total of 512 in a code block, and the resulting errors are thus correctable by the ECC. Moreover, each block is distributed among 32 RAM devices and failure of two RAM devices causes errors in only 32 symbols, again correctable by the ECC. The disclosed method does not, however, address the problem of data errors resulting from a correlated defect, such as an entire row defect, in each of an array of memory devices. In that case, there would be an error in a very large number of symbols in a code block, resulting in an uncorrectable set of errors.