1. Field of the Invention
The present invention relates to techniques for detecting and diagnosing the causes of anomalies within computer systems. More specifically, the present invention relates to a method and an apparatus that facilitates identifying the mechanisms responsible for “no-trouble-found” (NTF) events in computer systems.
2. Related Art
As electronic commerce grows increasingly more prevalent, businesses are increasingly relying on enterprise computing systems to process ever-larger volumes of electronic transactions. A failure in one of these enterprise computing systems can be disastrous, potentially resulting in millions of dollars of lost business. More importantly, a failure can seriously undermine consumer confidence in a business, making customers less likely to purchase goods and services from the business.
When enterprise computing systems fail, it is often due to an intermittent failure. During such failures, it is common for components, subsystems, or entire servers to indicate they have failed by either “crashing” or otherwise halting processing, with or without writing failure messages to a system log file. “No-Trouble-Found” (NTF) events arise when a service engineer is dispatched to repair a failed server (or the failed server is returned to the manufacturer) and the server runs normally with no indication of a problem. NTF events constitute a huge cost because large components, such as system boards (possibly costing in excess of a hundred thousand of dollars), may need to be replaced. Furthermore, it is embarrassing not to be able to determine the root cause of a problem, and customers are generally happier when a root cause can be determined. This give the customer some assurance that the root cause has been corrected, and is therefore not likely to cause a further disruption in the customer's business.
In high-end computing servers there is an extremely complex interplay of dynamical performance parameters that characterize the state of the system. For example, in high-end servers, these dynamical performance parameters can include system performance parameters, such as parameters having to do with throughput, transaction latencies, queue lengths, load on the CPU and memories, I/O traffic, bus-saturation metrics, and FIFO overflow statistics. They can also include physical parameters, such as distributed internal temperatures, environmental variables, currents, voltages, and time-domain reflectometry readings. They can additionally include “canary variables” associated with synthetic user transactions periodically generated for performance measuring purposes. Although it is possible to sample all of these performance parameters, it is by no means obvious what signal characteristic, “signature,” or pattern among multiple performance parameters may accompany or precede NTF events.
Existing systems sometimes place “threshold limits” on specific performance parameters. However, placing a threshold limit on a specific performance parameter does not help in determining a more complex pattern among multiple performance parameters that may be associated with an NTF event. Furthermore, threshold limits are not effective in capturing errors that caused by a stuck sensor, which does not trigger a threshold limit
Hence, what is needed is a method and an apparatus that facilitates detecting and diagnosing the causes of anomalies within computer systems based upon patterns in dynamic performance parameters.