As computer systems get larger and larger, as do their applications, the difficulty in monitoring all of the various applications on a system also increases. In particular, some systems may be distributed geographically (for example, in cloud computing), and multiple applications may run on multiple processors within a single computer system.
Further, these computer systems may be dynamically configured, with applications moving between processors as necessary. Additionally, the physical computer system may be dynamically configured with additional processors brought online as needed by the various applications. Monitoring such systems is extremely complex and it is difficult to configure monitoring systems such that they sufficiently monitor all of the various applications, provide a user sufficient and easily understandable alerts, and possibly to automatically repair some application problems.
Overview
In an embodiment, an application performance management system including a communication interface and a processing system is provided. The communication interface is configured to communicate with an agent deployed within a target computing system. The agent is configured to monitor a plurality of hierarchical operational elements that are executed within the target computing system.
The processing system is coupled with the communication interface, and is configured to receive first metrics associated with a first operational element from the agent, and to receive second metrics associated with a second operational element from the agent, wherein the second operational element is at a different hierarchical level than the first operational element.
The processing system is also configured to process the first and second metrics to determine at least one operational fault within the target computing system, and to determine one or more hierarchical levels of the at least one operational fault to identify a related operational element associated with a lowest hierarchical level of the at least one operational fault. The processing system is further configured to issue a status report to a user indicating the at least one operational fault and an identity of the related operational element.
In another embodiment, a method of managing a plurality of hierarchical operational elements executing within a target computing system is provided. The method includes receiving first metrics associated with a first operational element from an agent deployed within the target computing system, and receiving second metrics associated with a second operational element from the agent, wherein the second operational element is at a different hierarchical level than the first operational element.
The method also includes processing the first and second metrics to determine at least one operational fault within the target computing system, and determining one or more hierarchical levels of the at least one operational fault to identify a related operational element associated with a lowest hierarchical level of the at least one operational fault. The method further includes issuing a status report to a user indicating the at least one operational fault and an identity of the related operational element.
In a further embodiment, one or more non-transitory computer-readable media having stored thereon program instructions to operate an application performance management system is provided. The program instructions, when executed by processing circuitry, direct the processing circuitry to at least receive first metrics associated with a first operational element from an agent deployed within a target computing system, the target computing system executing a plurality of hierarchical operational elements.
The program instructions also direct the processing circuitry to at least receive second metrics associated with a second operational element from the agent, wherein the second operational element is at a different hierarchical level than the first operational element, and to process the first and second metrics to determine at least one operational fault within the target computing system.
The program instructions further direct the processing circuitry to at least determine one or more hierarchical levels of the at least one operational fault to identify a related operational element associated with a lowest hierarchical level of the at least one operational fault, and to issue a status report to a user indicating the at least one operational fault and an identity of the related operational element.
In another embodiment, a method of identifying a status for an operational element includes collecting a first plurality of metrics associated with a first operational element. A second plurality of metrics associated with a second operational element is also collected. An expert rule based on the first plurality of metrics and the second plurality of metrics is applied to determine a selected status for the first operational element.