1. Field of the Invention
The present invention relates to techniques for enhancing the reliability of computer systems. More specifically, the present invention relates to a method and an apparatus for large-scale, simultaneous validation of sensor operability within a computer system to enhance availability, quality of service and/or security.
2. Related Art
As electronic commerce grows increasingly more prevalent, businesses are increasingly relying on enterprise computing systems to process ever-larger volumes of electronic transactions. A failure in one of these enterprise computing systems can be disastrous, potentially resulting in millions of dollars of lost business. More importantly, a failure can seriously undermine consumer confidence in a business, making customers less likely to purchase goods and services from the business. Hence, it is critically important to ensure high availability in such enterprise computing systems.
To achieve high availability in enterprise computing systems, it is necessary to be able to capture unambiguous diagnostic information that can quickly pinpoint the source of defects in hardware or software. If systems have too little event monitoring, when problems crop up at a customer site, service engineers may be unable to quickly identify the source of the problem. This can lead to increased down time, which can adversely impact customer satisfaction and loyalty.
Fortunately, high-end computer servers, such as those manufactured by SUN Microsystems Inc. of Santa Clara, Calif., are now equipped with over 1000 sensors that measure variables such as temperature, voltage, current, vibration, and acoustics. These sensors are accurately calibrated during manufacturing and operational testing. However, after a server is shipped to a customer, there is presently no method to validate that these sensors are still operating within their desired specifications. Furthermore, the sensors often have shorter mean-time-between-failures (MTBF) than that of the computer systems they protect.
The effect of sensor failures or sensor degradation can be very costly in enterprise computing centers. If a sensor fails (i.e., it is no longer sensing the variable it monitors), the server protected by that sensor becomes potentially susceptible to severe failure modes (for example, over-temperature or over-voltage events). On the other hand, if sensors gradually drift out of calibration, or if they lose their dynamic response capability, then the resultant drift in the monitored value of the corresponding variable can cause system boards, components, or entire servers to shut down prematurely from “false alarm” events.
Manual recalibration of sensors is a labor-intensive and costly process that requires the server be brought down, and which consequently affects datacenter availability. Moreover, large-scale recalibration of sensors in large servers is impractical because of the difficulty in removing all system boards and components from the chassis. Finally, if the chassis needs to be opened for an engineer to calibrate even one sensor, it is likely to cause “maintenance induced failures,” i.e., the maintenance procedures can affect other interconnects or components and cause them to fail subsequently with an elevated probability.
Hence, what is needed is a method and an apparatus for validating the operability of sensors in a computer system without the need to open the chassis.