1. Field of the Invention
This invention relates to error correction in memory and more particularly relates to distinguishing between temporary and permanent bit errors in memory modules.
2. Description of the Related Art
Computer memory is an essential element of any computing system and data integrity is vitally important to prevent computing errors. Computer memory may be static, where data is retained when the computer is not operating, or dynamic, where data is lost when the computer is not operating. Random access memory (“RAM”) is a typical form of dynamic memory that loses its contents when a computer is shut down. Hard disk drives, compact disc (“CD”) drives, optical drives, and the like are forms of static memory that retain data when no power is applied.
FIG. 1A is a simplistic diagram illustrating how memory 100 may be organized. The memory 100 typically includes many cells 102 where each cell 102 represents a single bit. A common memory structure includes cells 102 organized into some type of matrix with columns 104 and rows 106. Typically, either the columns 104 or the rows 106 represent bits stored together as a particular memory address and often is sized to match a number of channels or lines of a data bus. In FIG. 1A for simplicity eight columns 106 of a particular row 104 represent eight bits of data stored in one memory location. A memory location. A memory location often includes 16 bits, 32 bits, 64 bits, etc., stored together in a memory location accessible by a memory address. For example, a row 104 of memory 100 may include 64 bits of data for a 64 bit processor using a 64 bit data bus. Often a memory location is physically part of a memory module and several memory modules operate together as a complete memory location of a suitable number of bits. Memory modules will be explained further in relation to FIG. 1B.
For the simplistic memory 100 shown in FIG. 1A, each row 104 represents a separate memory address with 8 bits of data, each stored in a separate cell 102. One method of reading and writing data to a particular memory location is to select a particular row 104 and then to store either a “1” or a “0” in each cell 102 of the row 104. Selecting a row 104 may require some type of wire 108 or data transmission pathway to activate the row and another wire 110 or data transmission pathway to each cell 102 in the row 104. If a cell 102 is made up of transistors, a row wire 108 may enable transistors within the row 104 and column wires 108 may be used to write data to each cell 102 of the row 104. For example, data may be read into row R1 such that the cells 102 of row R1 represent bits with values of 0010 1101.
It is not uncommon for data stored in memory to occasionally have an error. For example, a particular cell 102 may have an error. FIG. 1A depicts a cell 112 with an error in row R2 at the second bit of the row, corresponding to column C2 114. The error may be temporary or permanent. For example, an error may be due to some random voltage fluctuation, static discharge, alpha particle, etc., that causes a cell 112 to register a different value than intended. A permanent error may be caused by failure of a transistor, gate, discontinuity or failure in the memory material, etc., that may cause a cell 112 to remain in one state regardless of what is written to the cell 112.
Another type of error may cause a data transmission pathway from a particular cell 112 to be unresponsive to the contents of the cell 112. The data transmission pathway may access a number of cells 102. For example, the wire or data transmission pathway 114 corresponding to column C2 114 is depicted in FIG. 1A may be in error. Reading any address of the memory 100 may result in a “1” on column C2 114 regardless of the stored contents in the memory 100. This data line error may again be temporary or permanent. A data line error 114 may be considered worse than a cell error 112 because every memory location read using the data line 114 has a 50/50 chance of being in error.
Another type of memory error is a memory module error. FIG. 1B is a depiction of a system 101 of memory modules operating to provide 64-bit memory locations. The system 101 includes memory modules 1-9 116 in operation connected to a memory controller 118. A spare memory module 120 is also connected to the memory controller 118. Memory modules 1-8 116A-H each contribute 8 bits of memory, as depicted in FIG. 1A, to an addressable 64-bit memory location. Memory module 9 116I may be used to store error correction code (described below). The system 101 may include a spare memory module 120 that may be activated in case of failure of another memory module 116. For example, if memory module 1 116A fails, the spare memory module 120 may be quickly brought online to take over for memory module 1 116A. Detecting and correcting cell errors 112, data line errors 114, and memory module 116 errors is crucial to data integrity.
Computer memory often includes some type of error detection and correction to maintain integrity of data stored in the memory. Numerous error detection methods are and have been used to detect errors in data stored in memory. Some of the error detection methods allow correction of errors without requiring the source of the data to resend the data in error. Many commonly used error detection and correction methods can detect errors in two bits of a particular memory location and can correct single-bit errors.
Error-correcting code (“ECC”) may be the product of an error detection and correction scheme. Typically ECC for a particular set of data is stored with the data. For example, for a 64-bit system with 64-bit memory, a particular error detection and correction scheme may generate ECC based on 64 bits residing on or to be stored in the memory. The ECC may include a few extra bits that may be stored with the corresponding 64 bits of data. Examples of error detection and correction schemes that generate ECC include Hamming code, BCH code, Reed-Solomon code, Reed-Muller code, Binary Golay code, convolution code, and turbo code.
Computer, peripherals, application-specific integrated circuits (“ASICs”), etc. with memory that includes ECC stored with data often count recoverable errors. Recoverable errors are errors that can be corrected using the ECC associated with the data. For an error detection and correction scheme that can correct single bit errors, any data with a single bit error can be corrected using the ECC regardless of whether the cause of the error is temporary or permanent. A bit error count for recoverable errors may be used to signal a deterioration of the memory or an associated memory controller. The bit error count may be used to generate an error message of some type and may be used to preemptively signal a need to take corrective action, such as maintenance, memory replacement, etc. Non-recoverable errors typically cause more disruption and are usually dealt with on a more immediate basis.
A bit error count typically increments slowly for random events caused by temporary errors during normal operation, but then may increment more quickly as memory or a memory controller starts to degrade. By contrast, a single permanent bit error may increment the bit error count quickly. This would occur if the single bit permanent error was accessed frequently. A permanent single bit error condition may generate a lot of errors if the memory address containing the single bit error is accessed frequently. A bit error count may then increase quickly signaling a problem with the memory. However, the memory may be functioning correctly even though the single memory cell is incorrect.
One theory of memory management is that the memory should continue to operate because any errors in data caused by the permanent memory cell error can be corrected by ECC. In this case the reliability of the memory has been reduced because a second error at the memory address containing the permanent error is not correctable using typical ECC methods. However, the current state of the art is unable to distinguish between single single permanent errors and single random, temporary errors. In addition, the current state of the art is unable to distinguish bit line errors and memory module errors from other errors.