1. Field of the Invention
The present invention relates generally to cloud computing systems. More particularly, the present invention relates to collection and analysis of system events and correlation of such events with application behavior.
2. Description of Related Art
Cloud computing is a popular trend in modern computing. It disassociates applications and services from the underlying infrastructure resources such as servers, storage and networks. This permits a scalable and reliable architecture to run a large number of applications with flexible resource utilization. However, cloud computing come at a cost of increase complexity and interdependency between the various parts of the computing system.
For instance, when running an application or a service in a cloud environment, the behavior of the application or service depends on a large number of variables that are typically not visible to the application or service owner. This includes, for example, hardware malfunctions, network problems, power issues, system overload, system maintenance periods, peak traffic in adjacent applications, storage cells getting full, and so on. As a result, when an application or service owner notices anomalies in the behavior of the application or service, there is typically limited visibility to the various events that caused the anomalies. Thus, the owner may not have a clear direction on how to rectify or avoid the anomalies.
Various aspects of the cloud environment may be monitored separately. In this case, the application owner should have a comprehensive understanding of the entire environment and the other services hosted by it to appropriately understand the causes of anomalies in the system. Unfortunately, it is highly challenging to identify all the events and interactions that affect a certain application X (e.g. an adjacent services Y had a large data push that caused an overload on a disk array that is used also by a file system that is used by application X). As a result, the application owner is often at a loss to determine what caused degradation in the application. This may result in reduced system performance and may impact growth of the system in the future.