The reliability of a computer system is only as good as the underlying hardware of the system. Faults in the Random Access Memory (RAM) of a computer system, whether the faults are permanent or transient, often manifest themselves in the form of software instability and crashes. When applications or the Operating System (OS) of the computer system crash, the user may assume the cause is software related and therefore blame the software developer for the instability of their computer system. This not only hurts the reputation of the software developer in the marketplace, but it also requires the company to provide customer service to help users resolve problems that arise from RAM failures.
Aside from the harm to the software developer caused by faults in RAM, there is the possibility that a fault could result in corruption of the user's data or other undesired and unforeseen consequences.
To reduce memory errors, it is known to perform diagnostic tests on RAM at start-up. These tests are performed before the operating system is executing because they alter the contents of the memory, which would interfere with an executing operating system or other software components. For this purpose, some memory chips include circuitry to perform built-in self test (BIST) and can provide information identifying faulty pages in the memory to a memory manager in the operating system. Alternatively, some operating systems have incorporated memory tests such that the operating system itself can identify faulty pages in memory. The memory manager can then maintain stored information about faulty pages in the memory. When an application requests that memory to be allocated to it, pages that have been identified as faulty are not allocated.
It is also possible for memory tests to be implemented as application programs. These implementations generally don't have as much access to the memory and system resources as memory tests that are integrated into the OS—especially the kernel of the OS.
Some computer systems use memory that can correct errors through the use of error correcting coding (ECC). Each ECC has a strength that indicates a number of bit errors in a unit of data read from memory that can be corrected by the code. When more errors than can be corrected occur in a unit of memory, then the errors cannot be corrected. Though, the ECC may nonetheless reveal that an error occurred, such that additional faulty pages may be identified as the operating system is running.