The present invention relates to a method and apparatus for detecting and correcting errors in computer memory and, in particular, to on-chip error detection/correction in dynamic random access memory (DRAM) on a multi-word or page basis.
Bits stored in a computer memory are susceptible to the occurrence of errors, i.e., changes in the value of bits stored in memory. Such errors can be "hard" (i.e., permanent) hardware errors or "soft" (i.e., non-permanent) errors. Soft errors are often the result of radiation, such as cosmic radiation or radiation from decay of radioactive material. Soft errors occur in memory at random locations and times, although at an average rate which can be empirically determined.
A common scheme for detecting and correcting errors in memory involves storing, in addition to the desired data bits, one or more error correction ("EC") bits in association with the stored data bits. A number of algorithms are available for generating EC bits which permit the detection or correction of errors in the stored data bits. For example, Frederick F. Sellers, et al., Error Detecting Logic For Digital Computers, McGraw-Hill Book Co., describes common circuitry used in error detection. Error correction or detection can be performed at a number of levels. For example, an EC scheme can be devised to detect or correct, at most, a one-bit error in each EC cycle using a given number of stored error correction bits. In order to detect or correct two or more errors per EC cycle, a larger number of EC bits needs to be stored.
A typical error correction scheme is described in U.S. Pat. No. 4,719,627, issued Jan. 12, 1988, to Peterson, et al. In this scheme, following a row address strobe (RAS) and column address strobe (CAS), a 118-bit word and a corresponding 17-bit error correction code are transmitted over a bus to an off-chip error detection/correction device. This process is repeated for each read/write cycle.
Any error correction scheme requires a finite time for completion, which thus represents a cost of error correction. The time required for error correction is related to the number of error correction bits and the number of data bits being processed during any one error correction cycle. In general, greater efficiency is achieved by processing a larger number of bits in each EC cycle. In previous devices, the overhead cost of error correction was relatively high because the number of data bits and error correction bits processed in a single cycle was limited by at least two factors. First, the memory access mode used determines the number of data bits and error correction bits which can be accessed during a single cycle, i.e., a single assertion of required memory access signals, such as a row address strobe (RAS) and a column address strobe (CAS). Second, because prior devices required transmission of the data and error bits over a bus, the bandwidth of the bus limited the number of bits which could be transmitted for processing during any one EC cycle.
In addition to the time cost of EC processes, there is a storage cost, since memory which is used to store EC bits is unavailable for data storage. For any level of correction (i.e., one-bit error correction, two-bit error correction, etc.), the number of stored EC bits required is approximately a decreasing exponential function of the number of data bits being error-checked. For this reason, as the number of data bits checked in each EC cycle increases, the ratio of EC bits to data bits (and thus the ratio of unusable to usable memory locations) decreases. However, since, in previous devices the number of bits processed in a cycle was limited (as discussed above), the storage cost or ratio of unusable EC bits to usable data bits was relatively high.
Error correction schemes are able to correct no more than a predetermined number of bits per EC cycle. Thus, it is desirable to perform error correction with sufficient frequency that accumulation of more than the predetermined number of errors between correction cycles will be unlikely. To accomplish this goal, many memory systems include a periodic "scrubbing" cycle in which all memory locations are error-corrected. Such "scrubbing" is typically conducted in addition to any error correction which may be performed during normal read/write access. During the scrub cycle, the memory is unavailable for other use, and thus scrubbing represents an overhead cost of error correction. One approach to reducing such overhead is described in U.S. Pat. No. 4,682,328, issued Jul. 21, 1987, to Ramsay, et al. In this approach, parity checking and data recovery are performed during DRAM refresh. However, the efficiency of such parity checking is limited by the number of data bits which can be parity checked in each parity-checking cycle.
U.S. Pat. No. 4,335,459 issued Jun. 15, 1982, to Miller discloses providing error correction circuitry on the memory chip to increase yield and reliability and decrease cost and power consumption. This patent, as most previously described devices relate to memories in which each memory access provides data having a bit width equal to the word size. Several modern memory devices, however, permit access in one or more of several multi-word modes, i.e. a mode permitting access to more than one addressable unit without separately asserting and deasserting RAS and CAS for each addressable unit. In typical memories, each word of memory is an addressable unit. In such systems, multi-word modes would permit access to more than one word without asserting (and deasserting) RAS and CAS for each word which is accessed. Examples of multi-word mode memories include fast page mode memories and static column mode memories. Multi-word mode memories permit accessing, during any one memory cycle, a number of bits greater than the bit width of the data bus. Typically, such modes permit accessing all data in an entire row of the memory.