As is known in the art, cloud computing systems, even in pro-architected and pre-qualified environments, contain a relatively large number of hardware devices and components and software applications, modules, and components. In the presence of a fault, alert, or other condition needing attention, it can be difficult to identify the source of the fault or alert since there are many complex components that may be provided by multiple vendors which may make it difficult to correlate information to an efficient manner.
For example, in a cloud computing environment, alerts and events from various event sources in platforms normally contain limited information that may not be meaningful and may seem unrelated to the environment from which they originate. It is challenging for IT personnel to extract executable data from the alerts and take appropriate action.
With large volumes of alerts/events constantly coming from various sources, it is time consuming to troubleshoot all of them all my prioritization. It is challenging to prioritize the alerts/events and take appropriate actions without correlating them and knowing which of the alerts or events are root causes and which are just symptoms. In addition, many of the IT resources are managed in silos by IT personnel specialized in certain technology domains. For example, when a blade in the Cisco Unified Computing System (UCS) fails or has performance issues its impact propagates to the ESX server deployed on the blade, to the virtual machines deployed on the ESX server, to the applications or critical services running on those virtual machines, to the critical business that relies on those services. It may take hours or even days to sort through those alerts or events, which may result in significant detrimental impact on an enterprise.
Some existing products do not correlate events from external sources. They poll stains directly from the sources and generate their own events. There is no correlation between the events they generate and the events from the sources.
Some other products in the market, such as VMware vCenter Operations, do loose correlation based on topology relationship only. They do not account for the fact that different events on the same object may have different causality. For example, the two events of blade, “inoperable” and “unreachable”, have different symptoms. The former implies the ESX in the blade is definitely not functioning, while the latter simply means it cannot be reached but it may still be functioning.