A fatal error can be defined as an error that causes a loss in service due to failure in a hardware component. The loss in service may be temporary, for example as a result of transient errors in the system. A fatal error can also be defined as an uncorrectable error in hardware, which simply means an error that cannot be corrected at a particular moment in time. Fatal errors often result in the entire system rebooting, so not allowing the opportunity for a user to recover from the error.
Known systems provide various levels of error handling. For example, in the HP-UX operating system running on the HP-PA RISC platform, errors that need immediate attention are dealt with by the High Priority Machine Check (HPMC) handler. Program code referred to as the OS_HPMC routine can be called to perform error logging and recovery. The purpose of this routine is essentially to perform a system dump and to perform whatever recovery is possible.
However, existing systems do not demonstrate a full self-healing ability in the presence of fatal errors.