A typical computer system encompasses main memory hardware in which programs and data are saved. During runtime of the computer system, a module (such as a chip or a dual inline memory module, DIMM) within main memory may become defective. Since this memory module forms part of the system's overall memory address space, such a memory module failure will most likely imply a data loss of the system. Various methods and algorithms of detecting and possibly repairing data loss due to hardware failure are known, such as ECC (Error Correcting Code) and CRC (Cyclic Redundancy Check).
Once a hardware failure in one of the memory modules is detected, a high level system exception (interrupt) is issued indicating the need for attention. Such a hardware interrupt causes the computer system's processor to delegate process control to an exception handler. Exception handlers may have various functions which vary depending on the reason the interrupt was generated. The exception handler is accessed via an exception vector which is specific to the error encountered. Depending on the computer system's basic architecture, this exception vector corresponds either to the memory address of the exception handler or else to an index of an array called the exception vector table, which contains the memory addresses of exception handlers.
The exception handler corresponds to a piece of code which is installed and stored in memory during the computer system startup procedure. This (standard) mechanism for exception handling jeopardizes the reliability of the system, for the following reasons:                For one thing, the exception handler code that is to be used for handling a given memory failure is stored in a region of memory which itself is subject to errors. If the exception handler resides in an address range of the memory module which exhibited the error, and if this memory module error is uncorrectable, the corresponding exception cannot be handled. In a case like this, the computer system will detect a condition that cannot be resolved and which prevents normal operation. As a consequence, the computer system will typically shut down all processor clocks immediately, stop executing instructions, stop responding to interrupts, etc. This (clearly undesirable) state is commonly referred to as a checkstop.        In principle, this problem could be solved by storing the exception handler in a memory region which is regarded to be more reliable (such as on-chip SRAM (static random access memory), Flash ROM (read only memory) or cache). However, such memory is very expensive, and thus areas of safe memory can only be very limited in space. For exception handling in a computer, the memory area typically reserved for handling of a given exception type accommodates small pieces of code and is immediately neighbored by an area corresponding to a different exception type. On the other hand, exception handlers should involve a set of routines that provide for a graceful termination of the computer system (such as collecting checkpoint information, securing the most vital system data, collecting debug and analysis data etc.). This requires a larger storage space which is usually only available in general (unsafe) memory. Thus, the code stored in the safe memory area pertaining to a given exception type is generally no more than a branch to another (unsafe) region in memory in which the exception handler is stored. This brings about the risks described above.        
Thus, there is a need of making exception handling more reliable. U.S. Pat. No. 7,321,990 B2 describes a method of improving system reliability by self-migrating system software from a faulty memory location at a failure time. However, the migration handler itself may reside in a faulty memory location in which case self-migration will fail for the reasons explained above. Moreover, the failing memory module may already be too corrupt to be able to provide a copy for migration. Also, the method described in U.S. Pat. No. 7,321,990 B2 relies heavily on the concept of the x86 SMRAM and can thus only be applied to a limited range of computer architectures.