1. Technical Field
The present invention relates in general to memory configurations for computing systems, and in particular to fault detection and correction. More specifically, the present invention relates to a method and system for providing a high performance fault tolerant memory system utilizing greater than four-bit data word memory arrays.
2. Description of the Related Art
Memory systems employed in conventional data processing systems, such as computer systems, typically include large arrays of physical memory cells that are utilized to store information in a binary manner. Generally in a conventional memory system, all of the memory cells on a memory chip are disposed in one or more memory arrays having a set number of rows and columns. Operatively, the rows are selected by row decoders that are typically located adjacent to the ends of the row lines. Each of the row lines is electrically connected to the row decoders so that the appropriate signals can be received and transmitted.
The columns of the memory array are connected to input/output (I/O) through column decode devices. In the case of dynamic random access memories (DRAMs), the memory array columns are also connected to line precharging circuits and sense amplifiers at the end of each column line to periodically sense amplify and restore data in the memory cells.
There are two kinds of errors that can typically occur in a memory system, soft errors and hard errors. A soft error is a seemingly random inversion of stored data. This inversion may be caused by occasional electrical noise, environmental conditions and, in some cases, by bombardment of radioactive particles, the so-called alpha particle event. The soft error problem has increased as the individual cell sizes of the memory arrays have been reduced increasing their susceptibility to relatively low amounts of noise. Although soft error failure rates are generally 2-3 times the order of magnitude higher than hard error failure rates in DRAM arrays, soft error failures typically only cause single bit errors in memory system words. A hard error, in contrast, represents a permanent electrical failure of the memory array, often restricted to particular memory locations but may also sometimes associated with peripheral circuitry of the memory array so that the entire array can be affected. Naturally, designers of memory arrays have strived to reduce the occurrence of both hard and soft errors in their memory arrays. However, both types of errors have not been completely eliminated and, indeed, it is not believed that they can be eliminated. Designing to achieve high reliability beyond a certain point can be done only at the expense of reduced performance and increased cost.
One solution for detecting and correcting both hard and soft errors has been the implementation of error correction codes (ECC) in large computer memories. The fundamentals of error detecting and correcting are described by R. W. Hamming in a technical article entitled xe2x80x9cError Detecting and Error Correcting Codesxe2x80x9d appearing in the Bell System Technical Journal, Volume 26, No. 2, 1950 at pages 147-160. Utilizing one of the most popular Hamming codes, an 8 bit data word is encoded to a 13-bit word according to a selected Hamming code. A decoder can process the 13-bit word and correct any 1 bit error in the 13 bits and can detect if there are 2-bit errors. The described code, thus, is classified as SEC/DED (single error correct/double error detect). The use of such codes has been particularly efficient for memory arrays having single-bit outputs. For instance, if a relatively simple computer were to have 16K (16,348) bytes of data where each byte contains 8 data bits, an efficient error-protected design would utilize thirteen 16Kxc3x971 memory arrays with the extra five 16K memory arrays providing a Hamming SEC/DED protection. The Hamming code not only can correct a single bit hard or soft random error occurring in any byte, but can also further correct any one failed 16K memory array since any one memory array contributes only 1 bit per each error-protected word.
The above-described 13-bit Hamming code can only correct one error, whether it be a hard error or a soft error. Consequently, if one memory array has suffered a hard failure in all its locations, then the remaining memory arrays are not protected against an occasional soft error although it could be detected but not corrected. To be able to detect and correct more than one error, more elaborate error correcting codes have been developed and implemented. However, as a general rule, the more errors that can be corrected in a word, the more extra check bits are required by the check code.
Presently, memory arrays typically contain 256 Mbit devices and the trend is towards production of memory arrays that will contain 1 Gbit within two to four years. With the anticipated increase in memory array sizes, the present approach of utilizing 1 or 4 bit wide memory chip organization must be reconsidered. For example, employing the present 1 or 4 bit memory chip organization with a 32 bit wide data word will require 32 memory arrays (1 bit organization) or 8 memory arrays (4 bit organization). This will, in turn, result in a minimum granularity in, e.g., a personal computer (PC), of 8 GB or 2 GB, respectively. This large amount of memory in a desktop or laptop computer is excessive and also has the added disadvantage of increasing the overall cost of the computer system. In response to the minimum granularity problem, memory array manufacturers are moving to 8, 16 and even 32 bit wide memory organization schemes with the corresponding increase in the number of check bits required for error detection and correction.
Unfortunately, Hamming codes require several check bits to accomplish the error detection and correction. As discussed above, an eight-bit data word requires five check bits to detect two-bit errors and correct one-bit errors. As the bus grows wider and the number of bits of transmitted data increases, the number of check bits required also increases. Because modern memory buses are often 64 or 128 bits wide, the associated Hamming code would require substantially more check bits and increasing levels of logic circuits to implement the error correction code. Consequently, using powerful Hamming codes in large memory systems is expensive and consumes substantial memory resources. Additionally, these designs will result in slower memory access time due to the increasing number of logic circuits and logic levels necessary to implement the support and error correction code.
Accordingly, what is needed in the art is an improved error detection and correction scheme that mitigates the above-described limitations in the prior art. More particularly, what is needed in the art is a high performance fault tolerant memory system utilizing memory arrays organized to provide greater than four-bits.
It is therefore an object of the invention to provide an improved memory system.
It is another object of the invention to provide a fault tolerant memory system utilizing greater than four bit data word memory arrays.
To achieve the foregoing objects, and in accordance with the invention as embodied and broadly described herein, a method for providing a fault tolerant memory system having a number of memory arrays that includes at least one spare memory array and utilizing a data word organization of greater than 4 bits is disclosed. The method includes detecting a multi-bit word error in a memory array. In one advantageous embodiment, a single package detect (SPD) logic, for detecting a package error of 1-4 bits, is utilized to identify the failed memory array. Next, the content of a first row of cells in the failed memory array is read and a first complement of the content is generated. Subsequently, the first complement is written back to the first row of cells in the failed array. A second read operation is then initiated to retrieve the first complement from the failed memory array, following which, a second complement of the first complement is generated. The second complement is then written to a corresponding first row of cells in the spare memory array and the method is repeated for all row of cells in the failed memory array. The method further includes replacing the failed memory array with the spare memory array for future READ/WRITE operations by comparing the address of the failed memory array with a memory array address of a READ/WRITE operation and directing the READ/WRITE operation to the spare memory array in the event that the addresses are equal.
The present invention discloses a novel fault tolerant memory system utilizing SEC/DED/SPD codes, along with a READ/COMPLEMENT/WRITE/COMPLEMENT data restoration process to reconstruct and provide corrected data to data processing systems utilizing the fault tolerant memory system. The present invention further provides means for automatically replacing failed memory arrays with inplace spare memory arrays. The utilization of the SEC code in the present in the present invention is primarily to correct the more frequently occurring single bit soft errors that are non-repeatable, unlike hard errors. The hard errors are resolved within the context of the present invention utilizing automatic memory array sparing and data reconstruction techniques such as the READ/COMPLEMENT/WRITE/COMPLEMENT process. The present invention utilizes memory arrays organized with greater than four bits to take advantage of the anticipated memory arrays that will commonly available in future memory array technologies. The memory system of the present invention may be advantageous utilized in providing cost-effective large memory subsystems particularly required for high performance and highly reliable systems such as servers.
The foregoing description has outlined, rather broadly, preferred and alternative features of the present invention so that those skilled in the art may better understand the detailed description of the invention that follows. Additional features of the invention will be described hereinafter that form the subject matter of the claims of the invention. Those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiment as a basis for designing or modifying other structures for carrying out the same purposes of the present invention. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the invention in its broadest form.