Large data centers can experience frequent faults, and are a prominent contributor to data center management costs. Data centers can generate considerable monitoring data, but detecting faults and corresponding root causes from this data is difficult. Typically, this task is performed by manually observing data, using pre-defined thresholds.
Also, a virtualized cloud environment introduces new challenges including shared resources, dynamic operating environments with migration and resizing of virtual machines (VMs), varying workloads, etc. Shared resources compete for cache, disk and network resources, and challenges exist in differentiating between performance issues due to contention and application faults across several resources. Dynamism challenges exist in differentiating between performance anomalies and workload changes.
Virtualization is increasingly used in emerging cloud systems. Occurrences of various anomalies or faults in such large-scale setups contribute significantly to total management costs, along with performance deterioration of the underlying applications. Accordingly, a need exists for efficient fault management techniques for dynamic shared clouds that can distinguish cloud related anomalies from application faults.