1. Field of the Invention
Embodiments of the present invention relate to techniques for enhancing the availability and reliability of computer systems. More specifically, embodiments of the present invention relate to a technique for using a length-of-the-curve stress metric to characterize computer system reliability.
2. Related Art
Components in a computer system commonly experience dynamic fluctuations in temperature during system operation. These fluctuations can be caused by: changes in load; fluctuations in ambient air temperature (e.g., HVAC cycling in a datacenter); changes in fan speed; or reconfiguration of components in the computer system that affect air distribution patterns inside the computer system.
To ensure reliability, computer system designers typically qualify new components over an expected operational profile for the anticipated life of the computer system (e.g., 5 to 7 years). In addition, designers usually specify a maximum operating temperature for a given component, and some systems include shutdown actuators to prevent the components from exceeding the maximum operating temperature as a result of system upset conditions (e.g., failure of a fan motor, air conditioning failure, air filter fouling, etc).
However, it is not sufficient to merely prevent excessive temperatures. It is well-known that the components may also experience accelerated degradation as a result of thermal cycling within an acceptable temperature range. Unfortunately, there are currently no effective techniques for monitoring the cumulative stress from thermal cycling during the life of a system in the field. Some computer systems monitor simple parameters such as power-on hours (POH) and the maximum temperature achieved. However, the usefulness of these metrics is limited when attempting to predict the degradation of computer system components. For example, a monitoring system using these metrics alone may assign equal failure probabilities to a component that was operated at 1000 hrs. at a constant temperature of 25° C., but had one spike to 85° C., and another component that was cycled hourly between 25° C. and 85° C. for 1000 hrs. Reliability studies show that the latter component will have a much higher probability of failure.
Hence, what is needed is a method and apparatus for characterizing computer system reliability without the above-described problems.