1. Field of the Invention
The present invention relates to techniques for enhancing availability and reliability within computer systems. More specifically, the present invention relates to a method and an apparatus for replacing a signal from a failed sensor in a computer system with an estimated signal derived from correlations with other instrumentation signals in the computer system.
2. Related Art
As electronic commerce grows increasingly more prevalent, businesses are increasingly relying on enterprise computing systems to process ever-larger volumes of electronic transactions. A failure in one of these enterprise computing systems can be disastrous, potentially resulting in millions of dollars of lost business. More importantly, a failure can seriously undermine consumer confidence in a business, making customers less likely to purchase goods and services from the business. Hence, it is critically important to ensure high availability in such enterprise computing systems.
To achieve high availability in enterprise computing systems it is necessary to be able to capture unambiguous diagnostic information that can quickly pinpoint the source of defects in hardware or software. Some high-end servers, which cost over a million dollars each, contain hundreds of physical sensors that measure temperatures, voltages and currents throughout the system. These sensors protect the system by detecting when a parameter is out of bounds and, if necessary, shutting down a component, a system board, a domain, or the entire system. This is typically accomplished by applying threshold limits to signals received from the physical sensors. In this way, if a temperature, a current or a voltage strays outside of an allowable range, an alarm can be activated and protective measures can be taken.
Unfortunately, sensors occasionally fail in high-end servers. In fact, it is often the case that the physical sensors have a shorter mean-time-between-failure (MTBF) than the assets they are supposed to protect. Degrading sensors can cause domains or entire systems to shut down unnecessarily, which adversely affects system availability, as well as the customer quality index (CQI) and the customer loyalty index (CLI). An even worse scenario is when a sensor fails “stupid,” a term used to describe failure modes in which a sensor gets stuck at or near its mean value reading, but is no longer responding to the physical variable it is supposed to measure. No threshold limit test can detect this type of failure mode. Furthermore, if there is a thermal event, or other system upset, the dead sensor provides no protection and significant damage may occur to an expensive asset, followed by a serious system outage.
Hence, what is needed is a method and an apparatus that handles a failed sensor in a computer system without unnecessarily shutting down the computer system, and without exposing the computer system to the risk of damage.