This disclosure relates generally to memory devices, and more specifically, to selectively reporting memory errors.
A memory error is an event that leads to the logical state where the values read from one or more bits are different from the values that were last written. Memory errors, such as Direct Random Access Memory (DRAM) errors, are a concern because they can lead to machine crash or applications using corrupted data. Memory errors can be caused by electrical or magnetic interference (e.g., due to cosmic rays), problems with the hardware (e.g., a bit being permanently damaged), or can result from corruption along a data path between the memories and the processing elements. Memory errors can be classified as either soft errors, which randomly corrupt bits but do not leave physical damage, or hard errors, which corrupt bits in a repeatable manner due to physical defects in a memory device or data path.
Memory systems in server machines may employ error detecting and correcting mechanisms, including, for example, error correcting codes (ECC). ECC memories allow the detection and correction of single or multiple bit errors. An error is a correctable error (CE) when a system can reliably detect and correct at least a single erroneous bit. For example, single-error correcting and double-error detecting (SECDED) systems can detect double bit errors, but may only be able to correct single-bit errors.