1. Technical Field
The present invention relates in general to memory configurations for computing systems, and in-particular to fault detection. More specifically, the present invention relates to a method and system for detecting a hard memory array failure in a memory system.
2. Description of the Related Art
Memory systems employed in conventional data processing systems, such as computer systems, typically include large arrays of physical memory cells that are utilized to store information in a binary manner. Generally in a conventional memory system, the memory cells are arranged in an array having a set number of rows and columns. Operatively, the rows are selected by row decoders that are typically located adjacent to the ends of the row lines. Each of the row lines is electrically connected to the row decoders so that the appropriate signals can be received and transmitted. The columns of the memory array are connected to input/output (I/O) devices, such as a read/write multiplexer. In the case of dynamic random access memories (DRAMs), the memory array columns are also connected to line precharging and sense amplifier circuits at the end of each column line to periodically refresh the memory cells.
There are two kinds of errors that can typically occur in a memory system. The first is called a repeatable or xe2x80x9chardxe2x80x9d error. For example, a piece of hardware that is part of the memory system, e.g., an interconnect, is broken and will consistently return incorrect results. A cell within a memory array may be stuck so that it always returns xe2x80x9c0xe2x80x9d for example, no matter what is written to it. Hard errors typically include loose memory modules contacts, electrically blown chips, motherboard defects or any other physical problems. These types of errors are relatively easy to diagnose and correct because they are consistent and repeatable. The second kind of error is called a transient or xe2x80x9csoftxe2x80x9d error. This occurs when a memory cell reads back the wrong value once, but subsequently functions correctly. These error conditions are, understandably, much more difficult to diagnose and are, unfortunately, more common. Eventually, a soft error may repeat itself, but may surface only after a lengthy period of time. These types of errors result, for example, from memory devices where some cells are marginally detective, alpha particle events hitting on the cell area, or other problems which are not related directly to the memory. The good news, however, is that the majority of soft errors due to alpha particles result in single bit errors.
To preclude corrupted data from use, encoding techniques are developed and employed in data processing systems to provide for detection and correction of errors. The simplest and most well-known error detection technique is the single-bit parity code. To implement a parity code, a single bit is appended to the end of the data word stored in memory. For even parity systems, the value of the parity bit is assigned a value such that the total number of ones in the stored word, including the parity bit, is even. Conversely, for odd parity, the parity bit is assigned such that the total number of ones is odd. When the stored word is read, if one of the bits is erroneous, the total number of ones in the word must change so that the parity value for the retrieved data does not match the stored parity bit. Thus, an error is detected by comparing the stored parity bit to a regenerated check bit calculated for the data word as it is retrieved from memory.
Although a single-bit parity code effectively detects single-bit read errors, the system has limits. For example, if there are two errors, the parity value for the data remains the same as the stored parity bit because the total number of ones in the word stays odd or even. In addition, even though an error may be detected, the single-bit parity code cannot determine which bit is erroneous, and therefore the failed memory cell cannot be identified or corrected.
To provide error correction and more effective error detection, various error correction codes were developed which not only determine that an error has occurred, but also indicate which bit is erroneous. The most well-known error correction code is the Hamming code, which appends a series of check bits to the data word as it is stored. When the data word is read, the retrieved check bits are compared to regenerated check bits calculated for the retrieved data word. The results of the comparison indicate whether an error has occurred, and if so, which bit is erroneous. By inverting the erroneous bit, the error is corrected. In addition, a Hamming code detects two-bit errors which would escape detection under a single-bit parity system. Hamming codes can also be designed to provide for three-bit error detection and two-bit error correction, or any other number of bit errors, by appending more check bits. Thus, Hamming codes commonly provide greater error protection than simple single-bit parity checks.
Unfortunately, Hamming codes require several check bits to accomplish the error detection and correction. For example, an eight-bit data word requires five check bits to detect two-bit errors and correct one-bit errors. As the bus grows wider and the number of bits of transmitted data increases, the number of check bits required also increases. Because modern memory buses are often 64 or 128 bits wide, the associated Hamming code would be very long indeed, requiring considerable memory space just for the check bits. However, for wider data words, the data to check bit ratio decreases. The check bit overhead, therefore, is smaller as compared to smaller data words.
A further problem is caused by modern random access memory (RAM) chips. In early memory systems, RAM chips were organized so that each chip provided one bit of data for each address. Current RAM chips, however, are frequently organized into sets of four bits of data for each address. If one of these RAM chips fails, the result is one to four potentially erroneous data bits. Unless the error correction code is designed for four-bit error detection, a four-bit error may go completely undetected. Incorporating a four-bit error detection (four-bit package error detection) and one-bit correction code in a 64-bit or 128-bit memory system, however, would require eight or nine check bits. Currently, a large percentage of memory array production is utilized in personal computers (PCs) and it is anticipated that in the following years through 2004, PCs will typically employ 32 MB to 256 MB memory systems.
Presently, memory arrays typically contain 256 Mbit devices and the trend is towards production of memory arrays that will contain 1 Gbit within two to four years. With the anticipated increase in memory array sizes, the present approach of utilizing 1 or 4 bit wide memory chip organization must be reconsidered. For example, employing the present 1 or 4 bit memory chip organization with a 32 bit wide data word will require 32 memory arrays (1 bit organization) or 8 memory arrays (4 bit organization). This will, in turn, result in a minimum granularity in a PC of 8 GB or 4 GB, respectively. This large amount of memory in a desktop or laptop computer is excessive and unnecessary and also has the added disadvantage of increasing the overall cost of the computer system. In response to the minimum granularity problem, memory array manufacturers are moving to 8, 16 and even 32 bit wide memory organization schemes with the corresponding increase in the number of check bits required for error detection and correction.
Accordingly, what is needed in the art is an improved error detection technique that mitigates the above-described limitations in the prior art. More particularly, what is needed in the art is an improved method for detecting hard failures in a memory array that utilizes memory organization schemes greater than 4 bit wide.
It is therefore an object of the invention to provide an improved memory system.
It is another object of the invention to provide a method and system for detecting failures in a memory array.
To achieve the foregoing objects, and in accordance with the invention as embodied and broadly described herein a method for detecting a hard failure in a dynamic random access memory (DRAM) array, wherein the memory array has a plurality of cells organized in a matrix fashion of rows and columns. The method includes reading the content of a first row of cells of the memory array during a first refresh cycle. After obtaining the content from the first row of cells, a first complement of the content is generated. The generated first complement is then written back to the first row of cells during the writeback operation of the first refresh cycle. During the subsequent refresh cycle, the first complement in the first row of cells is read and a second complement of the first complement is generated. Next, the original content in the first row of cells is compared with the second complement. In response to the original content not being equal to the second complement, a control signal is generated to indicate a failure in the memory array. In a related embodiment, the second complement is written back to the first row of cells during the subsequent refresh cycle writeback operation to restore the content in the first row of cells to its original value.
In another aspect of the present invention, a failure detection circuit for use with a memory system having at least one memory array and a data IN/OUT buffer coupled to the memory array is disclosed. The failure detection circuit includes an inverter and a register coupled to the data IN/OUT buffer. The failure detection circuit also includes a comparator, coupled to the inverter and register, for comparing the contents of the inverter and register. An error signal is generated in response to the contents not being equal to indicate a failed memory array. The failure detection circuit is utilized in conjunction with normally occurring refresh operations of the memory array to detect failures in the memory array. In a related embodiment, the memory array is a dynamic random access memory (DRAM).
The foregoing description has outlined, rather broadly, preferred and alternative features of the present invention so that those skilled in the art may better understand the detailed description of the invention that follows. Additional features of the invention will be described hereinafter that form the subject matter of the claims of the invention. Those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiment as a basis for designing or modifying other structures for carrying out the same purposes of the present invention. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the invention in its broadest form.