The present invention relates to Online Error Detection for the Cloud Infrastructure.
With the fast growth of the global cloud-computing market, a number of dedicated software services have emerged that manage various aspects (e.g., computing, storage, and networking) of the cloud. For instance, Amazon Elastic Compute Cloud (EC2) and Microsoft Azure are two widely used public cloud services that enable users to easily set up computing platforms in the cloud with multiple servers and configurable storage and networking (Infrastructure-as-a-service, IaaS). OpenStack is another popular platform providing IaaS for public and private clouds. OpenStack is open-source, and has been gaining popularity steadily in recent years.
Cloud services and platforms manage and provide convenient access to computing, storage, and networking resources in the cloud. For example, they have provision for tasks that let a user spawn virtual machines (VMs), stop VMs, delete VMs, etc. These tasks often involve coordination and communication between multiple processes (e.g., authentication, scheduler to assign VMs to machines, booting up VMs, etc.) in different machines. The complexity and non-determinism in tasks can many a time result in subtle errors and performance issues that can be hard to detect.
Conventional systems perform an offline analysis using log messages for different instances of a task. They assume that the log messages would have identifiers that would distinguish between different instances of a task. They group log messages based on those identifiers. Then they create models (vectors, automata, etc.) for all the groups, and find which of the models are anomalous.