Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services). The services provided or accessed through cloud computing, such as via a network, can be referred to as cloud services. There is a lot of processing that needs to be performed by a cloud service provider to make cloud services available to a subscribing customer. Due to its complexity, much of this processing is still done manually. For example, provisioning resources for providing such cloud services can be a very labor intensive process.
Data centers supporting cloud computing systems tend to be very large in size, comprising thousands of compute and storage servers and hundreds of network and other devices. For example, recent statistics suggest that there are 80,000 or more virtual machines with 540 PB or more storage utilized for cloud computing systems provided globally by Oracle Corporation. There are at least 19 Tier 4 data centers 62 million or more active users resulting 30 billion or more transaction daily. Manual administration of the cloud data centers, even using command tools such as MCollective or Chef and monitoring tools such Graphite, can increase the cost of cloud services and can reduce the quality of services. Such tools may not react to and correct potential anomalies in system behavior, such as those affecting service level agreement (SLAs) and security breaches in a timely manner.
Some cloud computing system providers have implemented system to diagnose and correct problems detected in their cloud computing systems; however, the details as to how such systems are configured to detect problems have not been defined for the entire cloud computing system. Some have implemented machine learning algorithms to assess log files and/or developed training data to establish what is normal systems behavior. The log files and/or the data may be compared to normal patterns and any significant deviation is reported as anomaly. Multi-variate analysis techniques (e.g., MSET) can compare multiple log files at the same time. Inferring normal behavior from the log files alone via unsupervised machine learning techniques can be prone to errors. Computing issues identified solely on log files without regard to the system topology, processing flows, or log relationships can introduce a lot of noise as irrelevant combinations of log files may be analyzed that may adversely affect the diagnosis of issues. The possible errors detected and reported by such systems are so broad that it is not amenable to programmatic corrective action. Human beings may need to be involved to address the problems.
Anomalies in a cloud computing system or an enterprise computing system can be caused by many factors including load spikes, component failures, and/or malicious use of the system and they are manifested in increased resource usage, deteriorating key performance indicators (KPI), and spikes in errors in one or more containers. As a result of the challenges described above, quality of service (QoS) guarantees for service-level agreements (SLA) may often not be met. Any given time, millions of hardware and software components can fail at any given time in a cloud computing system and enterprise computing systems. User and operators alike can contribute to human errors and unexpected loads that cause anomalies. Malicious users can lead to outages affecting millions of users. These circumstances can lead to unsatisfactory QoS, resulting in violation of SLAs for cloud computing environments.
To deal with anomalies, some have attempted to monitor anomalies in near real time. These approaches involve collecting the state (metrics, logs, etc.) of the environment in a centralized storage and programmatically analyzing the state for anomalies. Collection of the state of the environment may incur latency due to communication and aggregation of such data. The analysis involves additional time and the result has to be communicated to the operations staff for manual correction of the anomaly following guidelines and scripts. Such corrective action may result in long latencies between the time the anomaly occurred and the time corrective action is taken. Collection and analysis of all log entries and metrics may be an inefficient use of resources, as most data in the log files correspond to normal conditions. The data may provide low signal-to-noise ratio since anomalies is the signal to be identified. Further, because anomalies relate to infrequently occurring cases, such as crashes, deadlocks, long response times, etc., analysis of data for normal conditions may provide minimal value. Fine-grain detection of anomalies are sought to identify precursor events to avoid conditions resulting in violation of SLAs in the first place.