The present invention relates to computer memory, and more specifically, to memory error recovery in a computer system.
In some applications, writing to memory in a computer system includes writing to one of multiple memory devices. For example, memory in a server is comprised of a number of memory devices such as dynamic random-access memory (DRAM) chips. Writing data to memory of the server typically involves writing to multiple DRAM chips. To ensure that data are correctly written and retrieved, error-correcting code (ECC) bits are generally written along with the data so that the ECC bits may be verified in the read data. The ECC bits are included with stored data through an encoding process and are verified in read data through a decoding process. Processing of the ECC bits by a decoder may lead to the inclusion of a chip mark. The chip mark identifies one of the DRAMs and indicates that all data from that DRAM must be corrected. Processing of the ECC bits may also lead to the inclusion of a symbol mark. A symbol is a subset of the addresses of one DRAM. The number of addresses in a range defined as a symbol may differ based on the memory device. Thus, the symbol mark identifies that data from a subset of addresses of one of the DRAMs must be corrected. The use of a symbol mark or chip mark for a soft error (e.g., a temporary error) locks up the availability of the marking feature such that marks are unavailable for a hard error (e.g., a persistent error).