Automated monitoring systems monitor information technology infrastructure along with complex software deployments, such as deployments within a cloud computing environment. The monitoring systems monitor the infrastructure and the deployments using metrics that represent the load, state, health, and behavior of each component in the infrastructure and each component of the software deployed in that infrastructure.
The software deployments can include application instances executing on nodes within the cloud computing environment. These applications can have many components. The system topology, which reflects the relationships and dependencies between these components, can be complex.
Separate monitoring systems can monitor separate components. These monitoring systems can generate events over time as a product of the monitoring that they perform relative to the components. The events can indicate changes in component state (e.g., a system component shutdown), changes in component behavior (e.g., a change in system component performance), problems affecting the component, etc.
System administrators may analyze the events generated by monitoring systems in order to discover issues in the subsystems of a cloud deployment that they manage. A monitoring system can indicate a state or status of a component through an availability metric. A monitoring system can indicate a behavior of a component through values of a metric. A monitoring system can generate a problem event in response to detecting the violation of an administrator-specified rule. Such a rule can specify, for example, that a value of a specified metric is not to exceed a specified threshold value.
A cloud computing environment can be highly dynamic. Applications deployed within the cloud computing environment can be heavily utilized. Heavy utilization can result in the consumption of system resources (e.g., storage devices such as hard disk drives can fill up). Heavy utilization also can cause hardware problems and changes in application performance. The dynamic nature of the cloud computing environment also can be due to system operators making changes to the environment. For example, system operators can deploy additional servers, shut down servers for maintenance, deploy or undeploy applications, etc. All of these situations can cause monitoring systems to produce events pertaining to affected components.
The vastness and topological complexity of a cloud deployment's infrastructure and applications contributes to the generation of huge volumes of events arising from situations that can impact numerous different subsystems within the cloud deployment. A change in one system component's state or behavior can influence several other systems components, potentially for the worse. For example, when a storage device becomes near to full, many separate components that store data to the storage device can be impacted.
Although events are designed to help system administrators to analyze and remedy problems within a cloud deployment, when very large numbers of events are being generated in relation to very many different components, the deluge of information can be difficult to comprehend. The difficulty is compounded by the dynamic nature of a cloud deployment as described above; a change in system topology or in system behavior can cause different events to be generated even in response to the same recurring problem.