1. Field
The present embodiments relate to techniques for analyzing telemetry data. More specifically, the present embodiments relate to a method and system for filtering telemetry data through sequential analysis.
2. Related Art
As electronic commerce becomes more prevalent, businesses are increasingly relying on enterprise computing systems to process ever-larger volumes of electronic transactions. A failure in one of these enterprise computing systems can be disastrous, potentially resulting in millions of dollars of lost business. More importantly, a failure can seriously undermine consumer confidence in a business, making customers less likely to purchase goods and services from the business. Hence, it is important to ensure high availability in such enterprise computing systems.
To achieve high availability, it is necessary to be able to capture unambiguous diagnostic information that can quickly locate faults in hardware or software. If systems perform too little event monitoring, when a problem crops up at a customer site, service engineers may be unable to quickly identify the source of the problem. This can lead to increased down time.
Fortunately, high-end computer servers are now equipped with a large number of sensors that measure physical performance parameters such as temperature, voltage, current, vibration, and acoustics. Software-based monitoring mechanisms also monitor software-related performance parameters, such as processor load, memory and cache usage, system throughput, queue lengths, I/O traffic, and quality of service. Typically, special software analyzes the collected telemetry data and issues alerts when there is an anomaly. In addition, it is important to archive historical telemetry data to allow long-term monitoring and to facilitate detection of slow system degradation.
However, the increased collection of telemetry data from computer servers has resulted in higher computational costs associated with analyzing the telemetry data. Such computational costs typically arise from the application of statistical-analysis techniques, including regression analysis and/or estimation techniques, to the telemetry data. While statistical-analysis techniques may allow anomalies in the computer servers to be identified and diagnosed, the computational costs may become unmanageable as increasing numbers of servers and components are deployed and monitored in production and an increasing density of sensors is used to monitor the components in each server.
On the other hand, highly available systems may only experience disturbances in performance a small fraction of the time. The vast majority of telemetry data collected from modern computer servers may thus represent normal functioning of the computer servers and display the same statistical quality indicators over time. As a result, constant application of computationally intensive statistical-analysis techniques to identify anomalies in the telemetry data may be both unnecessary and wasteful.
Hence, what is needed is a mechanism for reducing computational costs associated with analyzing telemetry data from computer servers.