1. Field of the Invention
The present invention relates to techniques for diagnosing causes of problems within computer systems. More specifically, the present invention relates to a method and an apparatus that facilitates determining the effects of temperature variations within a computer system while the computer system is operating.
2. Related Art
As electronic commerce grows increasingly more prevalent, businesses are increasingly relying on enterprise computing systems to process ever-larger volumes of electronic transactions. A failure in one of these enterprise computing systems can be disastrous, potentially resulting in millions of dollars of lost business. More importantly, a failure can seriously undermine consumer confidence in a business, making customers less likely to purchase goods and services from the business.
When enterprise computing systems fail, it is often due to a system hardware failure. During such failures, it is common for components, subsystems, or entire servers to indicate they have failed by either “crashing” or otherwise halting processing, with or without writing failure messages to a system log file. “No-Trouble-Found” (NTF) events arise when a service engineer is dispatched to repair a failed server (or the failed server is returned to the manufacturer) and the server runs normally with no indication of a problem. NTF events constitute a huge cost because system boards (possibly costing hundreds of thousands of dollars) may need to be replaced. Furthermore, it is embarrassing not to be able to determine the root cause of a problem, and customers are generally happier when a root cause can be determined.
In many cases, NTF events arise through intermittent failure mechanisms in hardware components. Some of these intermittent hardware faults coincide with small variations in the internal temperature of the servers. There are several theoretical explanations for such behavior, including changes in mechanical stresses, delamination of bonded components, thermal expansion effects on interconnects and soldered joints, exacerbation of microscopic electrostatic discharge effects, and other component reliability phenomena that are affected by temperatures, temperature gradients, and temperature cycling.
Hence, what is needed is a method and an apparatus that facilitates determining the causes of problems that arises from or are accelerated by temperature variations in a computer system.