Memory content errors can be classified as either persistent (or permanent) errors and transient (or soft) errors. Persistent errors are typically caused by physical malfunctions such as the failure of a memory device or the failure of a socket contact. Transient errors, on the other hand, are usually caused by energetic particles (e.g., neutrons) passing through a semiconductor device, or by signaling errors that generate faulty bits at the receiver. These errors are called transient (or soft) errors because they do not reflect a permanent failure. A “faulty bit” refers to a bit that has been corrupted by a memory content or signaling error.
A soft error does not always affect the outcome of a program. For example, a memory system may not read a faulty bit. Also, many memory systems include error detection and/or error correction mechanisms that can detect and/or correct a faulty bit (or bits). These mechanisms typically involve adding redundant information to data to protect it against faults. One example of an error detection mechanism is a cyclic redundancy code (CRC). An example of an error correction mechanism is an error correction code (ECC).
Some soft errors, however, can affect the outcome of a program. A faulty bit that is detected by a CRC or an ECC may still affect the outcome of a program if the error cannot be corrected. A more insidious type of soft error, is one that is not detected by the memory system. A soft error may escape detection if the system does not have error detection hardware that covers a specific faulty bit, and then that data bit is used by the system. Also, some faulty bits have errors that are weighted beyond the specification of the error protection mechanism used to detect them. The term “silent data corruption” (SDC) refers to an error that is not detected and affects program outcome.
The frequency that a system exhibits soft errors (e.g., the soft error rate (SER)) is typically expressed in failures in time (FIT). One FIT signifies one error in a billion hours. Memory systems are designed to operate within a specified FIT budget. There are a number of factors that can potentially impact a system's FIT budget.
Memory channels allocate some number of signaling bit-lanes to transfer data bits, and some number of bit-lanes to transfer error detection and correction bits. In general, a reduction in the number of bit-lanes in a memory channel leads to an increase in the exposure to silent data corruption. The reason for this is that the loss of a bit-lane causes a reduction in the amount of correction data that can be added to a packet of data sent through the memory channel. Typically, the amount of correction data added to a packet sent over a memory channel cannot be increased to compensate for a failed bit-lane because memory channels are designed to maintain short and precise round-trip times for packets.
One approach to maintaining an SER budget, despite the loss of a bit-lane, is to add a spare bit-lane to the memory channel. This spare bit-lane can be held in reserve and used for correction data if another bit-lane fails. For example, a fifteenth bit-lane can be added to a memory channel that normally includes fourteen bit-lanes. This fifteenth bit-lane can be used for correction data (such as CRC data) should one of the original fourteen bit-lanes fail.
The spare bit-lane approach, however, includes a number of disadvantages. Additional bit-lanes add complexity to a memory system and also increase the cost and the amount of power used by the memory system. Hence, alternative solutions that can maintain memory channel reliability without requiring spare bit-lanes are very desirable.