1. Field of the Invention
The present invention relates to techniques for enhancing availability and reliability within computer systems. More specifically, the present invention relates to a method and an apparatus for detecting thermal anomalies in computer systems based on correlations between instrumentation signals.
2. Related Art
Large eCommerce servers are increasingly being used in business-critical applications where data center outages can cost hundreds of thousands of dollars per minute. Unfortunately, large servers have large power appetites. For example, some of the next generation servers can consume up to 40 kilowatts of power. This power is ultimately converted to heat, which must be removed efficiently by continuous cooling. If internal components within the server are not kept sufficiently cool, failure mechanisms can accelerate, thereby degrading long-term system reliability and availability.
Most existing high-end servers are air cooled. One cause of problems in such servers is air-flow disturbances, which may be caused by a number of factors, including: obstructions at the inlet of the cooling-air intake; local obstructions inside the machine; a machine being moved slightly to an off-center position above a raised-floor cooling channel output (this output is supposed to mate approximately to the server's inlet channel); obstructions inside the raised-floor AC channel (for example, caused by someone routing new cables through the raised-floor channels); long-term fouling of air filters; or by problems with individual fans, which are deployed to pull cold air into and through the server.
Some high-end servers include numerous temperature sensors to protect the servers from over-temperature events. These sensors are configured to shut down system boards, domains, or the entire machine if temperatures exceed a threshold value of, such as 80 C. This type of temperature protection mechanism can effectively protect systems from acute over-temperature events. However, they are considerably less effective in protecting against the cumulative effects of lower-level temperature variations, which can significantly degrade long-term system reliability.
Existing thermal protection mechanisms lack the sensitivity to detect local airflow perturbations. For example, a common source of problems for high-end servers is having a piece of scrap paper get sucked against the cooling-air intake grill at the bottom of a server. This type of airflow obstruction can cause reliability problems, but will generally not be detected by existing thermal protection mechanisms, which are configured to have high threshold values.
Current environmental protection circuits are configured with high thresholds, and for a reason: when dealing with noisy process variables, if thresholds are set too low, a large number of false alarms would be generated from spurious data values. Note that data center ambient temperatures can vary by as much as 10 C just from normal HVAC cycling, and internal temperatures can vary even more as system load patterns vary. To avoid the possibility of nuisance shutdowns from false alarms, environmental sensors are typically configured with high threshold values that will protect the server from significant over-temperature events, but will be insensitive to more subtle perturbations from obstructing mechanisms, such as those described above. These latter perturbations, although insufficient to shut down a server, can nevertheless diminish the long term reliability of the server because of cumulative thermal stresses.
Hence, what is needed is a method and an apparatus that detects a thermal anomaly in a computer system without unnecessarily shutting down the computer system, and without subjecting the computer system to cumulative thermal stress.