Mainframes and server systems used in mission-critical environments are often designed and built to keep running for long periods of time without faults that cause erroneous output or system down-time. The need for more robust systems is increasing as system architectures become more complex. Even desktop systems are being designed and built with complex system interconnects and with multiple processor cores.
System traits that keep systems running for lengthy periods of time with minimal down-time include reliability, availability, and serviceability (collectively, “RAS”). Many features, called RAS features, are built into systems to increase reliability, availability, and serviceability. Among these are parity checks for memory components and buses, redundant system resources and components, parts that are more resistant to failure, temperature sensors to detect and respond to increased processor temperatures in real time, the ability to perform hot swapping of components, and many other features.
One set of RAS features increases a system's ability to detect and respond to an imminent failure of a system component without a system crash or a need to shut-down a system. Early detection of an imminent failure of a system component may allow a system sufficient time for a response that avoids system down-time.
Once an imminent failure of a system component is detected, an important RAS feature is the system's ability to respond without system down-time or the generation of faulty output. This ability is provided, in some systems, by system firmware. At times, this system firmware may temporarily assume control of a system to respond to a threat.