Real-time monitoring of the performance of productive software applications, especially of high transaction volume and revenue generating applications like e-commerce applications has become crucial for the successful operation of such applications, because even short-term performance or functionality issues potentially have considerable impact on the customer base and the revenue of such applications.
However, the monitoring and alerting demands of application operation teams, responsible for unobstructed, outage-minimizing operation of such application which require information about the overall situation of the application, deviates from the demands of software architects and programmers responsible for fast identification and elimination of program code causing the performance or functionality problems, which require high-detail transaction execution performance information, down to the granularity of individual method executions.
Existing monitoring systems aiming to trace and monitor individual transactions at the granularity level required by software architects and programmers reached a level of unobtrusiveness in terms of operation and of efficiency in terms of monitoring caused overhead that allows to employ such monitoring systems in the day-to-day operation of large, high-volume productive software applications. A detailed description of such a monitoring system can be found in U.S. Pat. No. 8,234,631 “Method and system for tracing individual transactions at the granularity level of method calls throughout distributed heterogeneous applications without source code modifications” by Bernd Greifeneder et al. which is incorporated herein by reference in its entirety.
Albeit such systems provide the data required by software engineers to identify and fix punctual performance problems, the granularity of the provided data is way to fine to allow operation teams a fast and precise judgment of the overall situation of a monitored application.
Especially in high-load scenarios where applications receive hundreds or even thousands of requests per minute resulting in the execution of hundreds and thousands of complex transactions per minute, a situation which is typical for modern e-commerce applications, conventional, threshold based alerting systems are inadequate due to the large number of generated false-positive alerts. Reason for this is the large number of transactions, which increases the possibility of performance outliers, which only reflect a negligible fraction of the performed transactions. Such outliers are also negligible from application operational and financial point of view, but would still trigger undesired alerts. Even baseline oriented alerting systems, using historic performance measurements to establish expected values for current and future measurements run into the same problem because they use the baseline threshold to create alerts based on single measurements.
Application operation teams mostly rely on infrastructure monitoring systems, which monitor the utilization of infrastructure resources, like CPU, disc or memory usage of the host computers running the monitored applications to determine the health state of an application and to decide appropriate countermeasures to detected or anticipated performance problems. As an example, the memory consumption of a process running an application may be monitored. In case the memory consumption exceeds a specific limit, the application process is restarted. Although this approach fulfills the needs of application operation, and may in case of an existing clustering/failover system cause no lost or failed transactions, it does not provide analysis data that helps to identify and fix the root cause (e.g. memory leak) of the problem.
The tendency to outsource and concentrate operation of such applications to external data-centers or even to cloud computing system adds another dimension of complexity to the problem of identifying the root cause of performance or functionality problems of productive applications, because it may blur the relationship between an application and the computing environment used to execute the application. In such environments, computing infrastructure like host computer systems, or processes running virtual machines, may be dynamically assigned to applications depending on the current load situation.
As a consequence, a monitoring and alerting system is required that fulfills the needs of both software development and maintenance teams and of application operation teams. It should on the one hand provide transaction tracing and monitoring data at finest granularity level, allowing the detection of the code causing the performance problem and on the other hand produce outlier and false-positive resistant, reliable alerts as required by application operation teams.
The desired solution should also be able to cope with outsourced applications or multi-application data-centers, where the monitoring system should be capable to identify and monitor a multiple of applications or application components, like e.g. a product search component or a product purchase component. Additionally, the desired solution should reduce the required configuration effort to identify and monitor applications or application components to a minimum.
This section provides background information related to the present disclosure which is not necessarily prior art.