In recent years, cloud computing has emerged as a preferred place for organizations to deploy their applications, store data, and enable remotely located employees and customers to access applications and data storage via the Internet. Cloud computing has also enabled independent cloud computing provides to sell cloud computing services, which enables organizations that purchase these services to decrease time to market while eliminating a heavy investment in information technology (“IT”) resources and operating expenses. For example, organizations that choose to run their applications and store data in a cloud computing infrastructure maintained by a cloud computing provider may scale resources according to changing computing and data storage demands and reduce costs by paying only for the resources and workloads they use.
Physical and virtual cloud computing resources are typically monitored to determine how certain resources perform with respect to different operations. The physical resources include server computers, data-storage devices, networks, and load balancers, and the virtual resources include virtual machines (“VMs”), virtual data-storage devices, and virtual resource pools, such as a specific combination of VMs and virtual data-storage devices. Each resource generates one or more metrics that indicate how often, or how much of, the resource is used over time. For example, typical metrics collected over time include number of buffer accesses, physical and virtual CPU usage, physical and virtual memory usage, physical and virtual data-storage availability, and electrical power consumption. After multiple metrics have been collected, the metrics may be evaluated to assess and track resource performance. Of particular interest to system administrator is the ability to identify anomalies that occur within the cloud infrastructure based on the metrics. When a metric exceeds or falls below an associated threshold, an alert is typically generated. However, the system administrator may not be able to identify when the problem started and identify which resource is, or group of resources are, responsible for the problem in order to isolate and terminate the resource or group of resources before catastrophic problems occur. For example, a metric associated with a server computer that violates a threshold may be a good indicator of server computer failure, slowdown, and other problems with the server computer. However, the system administrator does not know if the problem is with the server computer itself or is a problem created by one or more of the VMs running on the server computer.