Memory content errors can be classified as either persistent (or permanent) errors and transient (or soft) errors. Persistent errors are typically caused by physical malfunctions such as the failure of a memory device or the failure of a socket contact. Transient errors, on the other hand, are usually caused by energetic particles (e.g., neutrons) passing through a semiconductor device, or by signaling errors that generate faulty bits at the receiver. These errors are called transient (or soft) errors because they do not reflect a permanent failure. A “faulty bit” refers to a bit that has been corrupted by a memory content or signaling error.
A soft error does not always affect the outcome of a program. For example, a memory system may not read a faulty bit. Also, many memory systems include error detection and/or error correction mechanisms that can detect and/or correct a faulty bit (or bits). These mechanisms typically involve adding redundant information to data to protect it against faults. One example of an error detection mechanism is a cyclic redundancy code (CRC). An example of an error correction mechanism is an error correction code (ECC).
Some soft errors, however, can affect the outcome of a program. A faulty bit that is detected by a CRC or an ECC may still affect the outcome of a program if the error cannot be corrected. A more insidious type of soft error, is one that is not detected by the memory system. A soft error may escape detection if the system does not have error detection hardware that covers a specific faulty bit, and then that data bit may be used by the system. Also, some faulty bits have errors that are weighted beyond the specification of the error protection mechanism used to detect them. The term “silent data corruption” (SDC) refers to an undetected error that affects program outcome.
Memory channels allocate some number of signaling bit-lanes to transfer data bits, and some number of bit-lanes to transfer error detection and correction bits. In general, a reduction in the number of bit-lanes in a memory channel leads to an increase in the exposure to silent data corruption. The reason for this is that the loss of a bit-lane causes a reduction in the amount of correction data that can be added to a packet of data sent through the memory channel. Typically, the amount of correction data added to a packet sent over a memory channel cannot be increased to compensate for a failed bit-lane because memory channels are designed to maintain short and precise round-trip times for packets.
Conventional memory systems (e.g., fully-buffered dual inline memory systems) use a 12-bit CRC (e.g., CRC-12) to detect a link signaling fault on a memory channel having 14 bit-lanes. These conventional memory systems also separately use an ECC to detect (and possibly correct) memory content errors. The ECCs in conventional memory systems are optimized to get a target level of functionality with the lowest latency over the smallest number of memory bits. Conventional ECCs, however, are not optimized to provide signaling fault detection.
Memory systems exhibit latency for reasons related to the input/output (I/O) rate of the memory channel and the access time of the memory devices. This latency is frequently important when designing a memory system. For example, conventional memory systems are typically designed to provide high reliability at the lowest possible latency. To meet these design goals, a minimum packet size is typically selected for packets transmitted over the memory channel. The minimum packet size typically includes K data bits protected by the minimum number of J correction bits needed to achieve a targeted level of reliability.
Recently, the I/O rate of dynamic random access memory (DRAM) has increased at a much faster rate than the access time for DRAM. Thus, the share of latency due to the I/O rate is decreasing in comparison to the share of latency due to access time. Many conventional memory systems do not, however, take full advantage of the increase in I/O rates.