Virtualized infrastructures are widely used to provide large-scale services, which typically involve executing multi-tier applications. Automation is key in enabling management of these large-scale services, where human handling of various tasks such as deployment, upgrades, recovery, etc. becomes infeasible. In order to maintain a smooth flow of services performed by virtual machines running in a virtualized infrastructure where failures are not uncommon, detecting performance anomalies is a critical as well as a challenging task. While many routine tasks related to normal operation of a service can be automated, detecting abnormal behavior is complicated due to its undefined nature.
Performance anomalies can be of two broad categories in nature: complete unavailability or poor quality-of-service (QoS). There exist many techniques aimed at handling the former kind of scenarios where dead (either crashed or isolated) hosts or virtual machines are detected through network and storage heartbeat-based mechanisms. These techniques work well since unavailable hosts and virtual machines can be easily detected by their lack of response. However, poorly performing virtual machines are more difficult to detect because poor performance of a virtual machine depends on many factors and is not easily definable. Various techniques have been proposed to detect such anomalous virtual machines using reference/prediction mechanisms. These techniques typically use an application model or signature, either developed offline and learned online. Based on this model, the state of the application is determined as either healthy or unhealthy. However, there are several drawbacks with such methods. Application models are very specific to the application and platform configuration. These models need to be adapted for various execution environments. Further, developing an accurate model of the application may involve large number of metrics which may require specialized support from the monitoring infrastructure.
Another technique for handling poorly performing virtual machines involves the notion of “health checking” using an agent that monitors the health of the virtual machines based on the user specified configurations and marks any virtual machines that do not meet the healthy condition criteria as unhealthy. However, this functionality is quite limited as it requires the users to define the poorly performing behavior of the virtual machines.