A first type of existing solutions for application performance management relies on deep inspection, i.e. at code level, of the processes that are running on monitored IT infrastructure. The inspection is done by an agent that is installed on the server or IT infrastructure that is monitored and that runs in the monitored applications. The agent for instance identifies individual queries of a database access application that are running slow.
The APM solution from New Relic relies on an agent that identifies performance impact of specific code segments or Structured Query Language (SQL) queries. The APM solution from New Relic is described, for example, in the description of application monitoring on New Relic's website: newrelic.com.
Also the APM solution from AppDynamics provides code level visibility into performance problems. Information on this APM solution is retrievable, for example, in the description of application performance management on AppDynamics's website: AppDynamics.com.
The agent that inspects the performance at code level in the just described first type of APM solutions uses substantial resources, i.e. CPU time, of the server or IT infrastructure that is monitored, and consequently negatively impacts the performance thereof.
Further, the agent based APM solutions enable only bottom-up analysis. The agent shall for instance detect that certain database queries are slow and shall report this. The slow database queries however may result from a back-up process that is running in parallel, but the agent will not be able to link the slow running queries to the back-up process. This can only be detected through manual inspection of data received from the agent(s).
A second type of existing solutions for application performance management relies on a lightweight agent installed on the server or IT infrastructure that is monitored. The lightweight agent does not run in the application that is monitored but reports server performance metrics at higher level, e.g. CPU usage, disk occupancy, memory usage, etc. An example of the second type of existing APM solutions is known from Boundary and is described, for example, in the products on Boundary's website: boundary.com.
A lightweight agent that does not run in the monitored application(s) has a smaller impact on resource usage of the monitored server or IT infrastructure. The lightweight agent however requires intelligence elsewhere to efficiently pinpoint the source of problems with minimal false alarms.
Existing solutions based on a lightweight agent, like for instance the one from Boundary, visualize the metrics obtained from the lightweight agent but pinpointing the source of problems remains a manual process.
A third type of existing solutions for application performance management works agentless. Agentless solutions do not impact resource usage of the monitored server(s) or IT infrastructure.
Agentless APM solutions however are less accurate, i.e. they generate more false alarms since no performance metrics obtained on the monitored server or IT infrastructure can be taken into account.
United States Patent Application US 2013/0110761 entitled “System and Method for Ranking Anomalies” describes an APM system wherein lightweight agents collect performance metrics, e.g. CPU idle time, and a central processor detects anomalies associated with one or plural data metrics. In order to tackle the problem of growing data centres and growing number of anomalies, the anomalies are ranked/categorized by severity or criticality.
Although the system known from US 2013/0110761 relies on lightweight agents that are non-intrusive for the servers and applications that are monitored, this system lacks the ability to identify the source of problems. Anomalies are ranked/prioritized to tackle the scalability problem, but the system known from US 2013/0110761 fails to exploit synergies or relations between detected anomalies.
United States Patent Application US 2008/0235365 entitled “Automatic Root Cause Analysis of Performance Problems Using Auto-Baselining on Aggregated Performance Metrics” describes a system for application performance monitoring with agents that report metrics, e.g. response time, error count, CPU load, and transaction identifiers or other transaction context data identifying execution paths and/or calling relationships between components of the system (see FIG. 9: 900 and 925). The transaction information is used to correlate the performance metrics (see FIG. 9: 905 and 930) in an attempt to deal with the scalability problem resulting from ever growing data centres and massive volumes of reported performance metrics resulting therefrom. Thereafter, anomalies are detected by comparing metrics to (adaptive) baselines and a drill down procedure is used to pinpoint the root cause and identify anomalous components of the system.
The approach followed in US 2008/0235365 is disadvantageous for two reasons. Firstly, the central manager in the system of US 2008/0235365 relies on transaction information, i.e. information that specifies the sequence of hosts called for each transaction, as is illustrated for instance by FIG. 3a-FIG. 3d in US 2008/0235365. Collecting and reporting such detailed information can only be done at application level, hence requiring agents on the hosts that are very intrusive in terms of impact on the performance of the servers or hosts that are monitored. Secondly, the correlation of massive volumes of performance metrics using massive volumes of transaction information is demanding on the central manager. The excessive information shared between the agents and the central manager in other words results in overhead, both at agent side and at central manager side.