1. Field of the Invention
The present invention relates to an apparatus and method for correcting errors in data accessed from a memory device.
2. Description of the Prior Art
It is known to use error correction codes (ECC) in order to protect a data packet from various forms of data corruption. Typically, this is achieved by treating the data packet as a series of data symbols of fixed length, and then adding a number of ECC symbols so that the data symbols and ECC symbols collectively form a code word. Using such a technique, if m ECC symbols are added when forming the code word, then up to m/2 randomly located symbol errors can be located and corrected within the code word. There are various known ECC coding techniques for generating the symbols of the code word. For example, one technique uses Reed Solomon codes, these codes being based on Galois field mathematics and having properties which make them suitable for hardware implementation.
One practical application for such an ECC coding technique is in memory devices, for example memory devices using DRAM (Dynamic Random Access Memory). One known arrangement of such a memory device involves providing a number of Dual Inline Memory Modules (DIMMs), where each DIMM consists of a number of DRAM chips on a circuit board, including at least one chip reserved for storing ECC information. Often, such a memory device is accessed via burst access operations, each burst comprising a plurality of beats, and the DRAM chips of the DIMM being accessed during each beat. In such an arrangement, it is known to treat the entirety of the data to be written to the memory device via a burst write access as forming the data packet, with a plurality of ECC codes then being generated to add to that data packet in order to form the code word. As mentioned earlier, if the code word includes m ECC symbols, then up to m/2 randomly located symbol errors can be corrected when the data is subsequently read from the memory via a burst read access.
Considering as a specific example a DIMM consisting of nine 8-bit DRAM chips (meaning that eight bits of data can be accessed from each DRAM chip per beat of the burst access operation), then a 72-bit DRAM interface can be provided to enable each of the nine DRAM chips to be accessed every beat. If one of the chips is reserved for storing ECC symbols, then eight bits of ECC data can be accessed per beat. If it is also assumed in this example that each symbol comprises eight bits, then it will be seen that during an example eight beat read operation, 64 data symbols and 8 ECC symbols will be accessed. Using the above mentioned property, this will mean that up to four randomly located symbols can be in error and still be corrected.
However, it is often the case that the number of beats in the burst exceeds m/2. For example, in the above particular example of an eight beat read operation, there were eight ECC symbols (i.e. m=8) read, but there were also eight beats. Further, it has been found that errors tend to accumulate in a single chip on a DIMM, and if a particular chip within the DIMM fails, then it is likely that all of the symbols accessed from that chip during a particular burst read operation will be in error. In such a situation, the error correcting capabilities of the ECC scheme will have been exceeded.
One known way to seek to provide resilience against an entire chip failure is to provide a finer granularity of chips within the DIMM. For example, considering the earlier mentioned example of a DIMM comprising nine 8-bit DRAM chips, an equivalent DIMM could also be provided by eighteen 4-bit DRAM chips (where each DRAM chip hence provided four bits (also referred to as a nibble) of data per beat). If each symbol now comprises 4-bits, it will be seen that two of the 4-bit DRAM chips can be used to store ECC symbols, hence giving rise to sixteen 4-bit ECC symbols within the code word formed by the entire eight beat burst transfer. This then enables eight 4-bit randomly located symbol errors to be detected and corrected, hence allowing a single 4-bit chip failure (where all eight of the accesses to that chip output symbols with errors). The problem with this approach is that it requires more chips and a smaller symbol size. One implication of a smaller symbol size is that it can restrict the coding of the Galois field, and hence the code word formation. For example, in Reed Solomon codes the maximum code word is 2symbol size.
Another known approach is to effectively double the size of the burst transfer, and access two DIMMs in parallel. Considering the earlier example where each DIMM consisted of nine 8-bit DRAM chips, such an approach this would yield a 144-bit DRAM interface that included two ECC symbols (where the symbol size is 8 bits) per beat. With this configuration, it is apparent that the ECC scheme can be used to protect against a single 8-bit chip failure appearing in one of the two DIMMs, again assuming a burst transfer of eight beats. Alternatively, if each DIMM included 4-bit chips, such an approach could protect against two 4-bit chip failures occurring, since the use of 4-bit chips produces four ECC symbols per beat, or 32 ECC symbols in total (assuming a burst transfer of eight beats), allowing sixteen random symbol failures, equivalent to two symbols per beat. The disadvantage of such an approach is that it requires data to be spread across multiple DIMMs, hence requiring additional memory resource and also increasing the minimum memory access requirements. Further, it assumes that the Mean Time to Board Failure (MTBF) will differ between the two DIMMs since otherwise if a chip failure was likely to occur at the same time in both DIMMs such an approach would provide no benefit.
Accordingly, it would be desirable to provide an improved mechanism for increasing resilience to failure within a region of a memory device.