This invention relates to computer operation, and more particularly to a method of handling memory errors in computers.
Semiconductor memory devices used in computers are manufactured using cell sizes and cell densities such that the devices are susceptible to alpha particle failure, particularly dynamic RAMs or DRAMs. Packaging for the memory devices inevitably contains radioactive elements which upon decay result in alpha particles which penetrate the silicon die. An alpha particle hit can cause a cell to switch state. To circumvent this type of error, ECC (error checking and correcting) circuits are used in computer memory systems. An ECC circuit adds a check code to each block of data as it is stored, calculated on the basis of the data being stored. When this data is later read, the stored code is checked against the read data, and if there is an erroneous single bit within this data the bit is corrected before the data is sent to the CPU. If more than one bit is bad, however, the data is un-correctable, and an error fault is signalled. Thus, transient single-cell errors prevalent in DRAMs due to alpha particle hits can be tolerated in a computer system, since the occurrence of this type of error is virtually transparent to the executing application.
It has been the practice in operating computer systems to record and report errors occurring during operation. This recording takes the form of an error log. Corrected errors such as facilitated by ECC circuits used in memory and other points in the system do not prevent the system from continuing to function properly, but nevertheless such errors may indicate components likely to fail catastrophically, and thus should be replaced. However, correctable errors caused by alpha particle hits do not provide useful information from this standpoint, since replacement of the memory devices showing alpha hits would not cure or reduce the hits in the future. Alpha hits are estimated to be one hundred times more prevalent than hard errors in correctable read errors occurring in DRAMs. Thus, it is desirable to distinguish correctable read errors caused by alpha particle hits from other types of errors in recording and reporting on the operation of a computer system.
It has previously been the practice to "scrub" memory locations that show correctable errors such as those produced by alpha particle hits in DRAMs. By scrubbing is meant that the memory location is copied onto itself, with any errors being corrected by ECC circuits, so a transient error is eliminated. Memory subsystems are commonly used which the ability to perform the memory scrubbing operation independently, transparent to the CPU. In such case, the memory subsystem contains hardware necessary to detect that an ECC error has occurred, noting the address, and to generate a read and write operation to this location. Alternatively, a scrub operation can be implemented by the CPU as part of its operating system. When implemented by the memory subsystem this is referred to as hardware scrubbing, or if implemented by the CPU via its operating system this is referred to as software scrubbing.
A virtual memory type of operating system such as VMS.TM. or UNIX.TM. has the ability to replace a page frame number when a particular page of physical memory produces error faults. When a read is attempted and is unsuccessful due to parity errors or the like, the operating system can copy the page frame to another page frame number in physical memory (another physical location) and put the old page on a "bad memory" list so it will not thereafter be used by the computer.
The previous ways of operating computer systems with regard to memory errors have thus included scrubbing memory locations which exhibit read errors correctable by ECC circuitry, and replacing page frame numbers for memory locations that report non-correctable errors. In both cases the occurrence of errors is logged so that field replacement of faulty components, or those likely to fail, is facilitated.