The present invention relates to unsupervised behavior learning and anomaly prediction in a distributed computing infrastructure. In one particular example, the invention is implemented in an Infrastructure-as-a-Service (“IaaS”) cloud system.
IaaS cloud systems allow users to lease computing resources in a pay-as-you-go fashion. In general, a cloud system includes a large number of computers that are connected through a real-time communication network (e.g., the Internet, Local Area Network, etc.) and can run numerous physical machines (“PMs”) or virtual machines (“VMs”) simultaneously. A VM is a software-based implementation of a computer that emulates the computer architecture and functions of a real world computer. Due to their inherent complexity and sharing nature, cloud systems are prone to performance anomalies due to various reasons such as resource contentions, software bugs, and hardware failures. It is difficult for system administrators to manually keep track of the execution status of tens of thousands of PM or VMs. Moreover, delayed anomaly detection can cause long service delays, which is often associated with a large financial penalty.
Predicting and detecting anomalies, faults, and failures is of interest to the computing community as the unpredicted or surprise failure of a computing system can have numerous negative impacts. Broadly, contemplated approaches of failure prediction can include supervised learning methods and unsupervised learning methods. Supervised learning methods rely on labeled training data to accurately identify previously known anomalies. Unsupervised learning methods do not require labeled training to identify anomalies. These methods can be further divided into anomaly detection schemes and anomaly prediction schemes. Anomaly detection schemes identify a failure at the moment of failure, while anomaly prediction schemes try to predict a failure before it occurs.