1. Field
The present embodiments relate to techniques for analyzing telemetry data. More specifically, the present embodiments relate to a method and system for analyzing telemetry data using a multivariate sequential probability ratio test.
2. Related Art
As electronic commerce becomes more prevalent, businesses are increasingly relying on enterprise computing systems to process ever-larger volumes of electronic transactions. A failure in one of these enterprise computing systems can be disastrous, potentially resulting in millions of dollars of lost business. More importantly, a failure can seriously undermine consumer confidence in a business, making customers less likely to purchase goods and services from the business. Hence, it is important to ensure high availability in such enterprise computing systems.
To achieve high availability, it is necessary to be able to capture unambiguous diagnostic information that can quickly locate faults in hardware or software. If systems perform too little event monitoring, when a problem crops up at a customer site, service engineers may be unable to quickly identify the source of the problem. This can lead to increased down time.
Fortunately, high-end computer servers are now equipped with a large number of sensors that measure physical performance parameters such as temperature, voltage, current, vibration, and acoustics. Software-based monitoring mechanisms also monitor software-related telemetry parameters, such as processor load, memory and cache usage, system throughput, queue lengths, I/O traffic, and quality of service. Typically, special software analyzes the collected telemetry data and issues alerts when there is an anomaly. In addition, it is important to archive historical telemetry data to allow long-term monitoring and to facilitate detection of slow system degradation.
A threshold-based monitoring system may commonly be used to detect anomalies in the telemetry data by determining whether each telemetry parameter is operating within a specified range. If the value of a telemetry parameter goes out of range, the threshold-based monitoring system generates an alert. However, threshold-based monitoring systems are typically associated with a number of drawbacks. In particular, the accuracy of a threshold-based monitoring system may depend heavily on the accuracy of the sensors used to measure system parameters. If a sensor is imperfect and returns noisy signals, the sensor may cause the threshold-monitoring system to malfunction. Moreover, process variations during the sensor manufacturing process may cause measurement differences between different sensors. These measurement differences can also cause the threshold-monitoring system to malfunction.
Such drawbacks may partially be overcome by setting wide thresholds, which reduce the incidence of false alarms in threshold-based monitoring systems. Unfortunately, wide thresholds may delay the detection of failure conditions until degradation has reached an advanced stage. Late detection of degradation may further preclude the use of preventive maintenance and lead to forcibly shutting down the computer system for maintenance purposes, resulting in loss of productivity and business.
Hence, what is needed is a technique for early detection of degradation in monitored computer systems.