Object oriented and bytecode based software development platforms, including the SUN MICROSYSTEMS® JAVA™ platform and the MICROSOFT®.NET platform, have gained wide acceptance for developing Internet and Enterprise class software applications. Bytecode based software provides cross-platform and cross-language compatibility and eases the networked integration of software applications.
Remote method invocation available for the platforms, like SUN MICROSYSTEMS® Remote Method Invocation (RMI) and the MICROSOFT®.NET Remoting system, and messaging services like the SUN MICROSYSTEMS® JAVA™ Messaging Service (JMS) or the MICROSOFT® Messaging Queue ease the creation of distributed and loosely coupled architectures.
Approaches like service oriented architecture (SOA) use this features to provide flexible application architectures which can be adapted to rapidly changing market demands.
Albeit, this flexibility eases building and updating the functionality of applications, it constitutes a challenge for conventional performance monitoring and tracing tools which traditionally consider the scope of an isolated application. Most existing tools are not prepared to trace transactions over the borders of threads or different virtual machines.
Following the execution path of a transaction over the borders of threads, processes or different host systems is essential for tracing complete end-to-end transactions, which may be processed by a chain of different application servers that may communicate in various ways.
Information that depicts the different processing stages on different application servers and provides specific performance information for the processing stages is a precondition for performance analysis of distributed applications. To provide such information, it is required to correlate isolated trace information acquired from the different servers participating in a transaction, to depict a consistent end-to-end transaction.
There are some systems available that provide mechanisms for tracing distributed transactions, but those systems either depend on specific properties of the monitored system, like e.g. synchronized clocks of servers involved in distributed transactions or generate insufficient correlation information. The generated correlation information is sufficient to reconstruct parent-child relationships between parts of distributed transaction executed on different servers, but they fail to reconstruct the exact sequence in which child parts of the transactions were activated. Other systems only provide post-mortem analysis of transactions, or are not able to analyze blocked or stalled transactions.
Some of the existing monitoring systems have memory requirements which are depending on the nesting depth of executed methods, which makes a prediction of the memory overhead caused by the monitoring system impossible, and which may cause a crash of the monitored system in case of deep nested method executions which may e.g. occur in recursive methods.
JaViz[2], is a monitoring system developed by IBM which allows tracing of distributed transactions running on JAVA™ platforms, using the SUN MICROSYSTEMS® RMI framework for communication. The system amends the used virtual machines in a way that they generate tracing data for each executed method, which is written to trace files. The trace files contain statistical information about local method calls, outbound remote method calls and inbound remote method service requests. The trace files also contain correlation data which enables to match outbound remote method calls invoked on one virtual machine with the corresponding inbound remote method service request received on another virtual machine. After a distributed transaction is terminated, a merging tool is executed, which evaluates the correlation data stored in the trace files generated by the involved virtual machines. The merging tool generates an overall trace file which describes the whole distributed transaction. The so generated trace file is interpreted by a visualization tool which provides a tree-like view of the transaction. Although JaViz provides useful information for analyzing distributed transactions, the restriction to post-mortem analysis and the relatively complex handling of the different trace files exclude this approach from usage in production environments.
The Application Response Measurement framework (ARM) [2], a standard for monitoring application performance, created by Hewlett-Packard and Tivoli WebSites, provides infrastructure for real-time monitoring of distributed transactions. To trace transactions with ARM, calls to ARM methods are inserted at the entry points and all exit points of methods which should be monitored. This requires access to the source code of the application which should be monitored and the ability to rebuild the application after ARM monitoring is included. Accessing the application source is often difficult or even impossible. Additionally the requirement to adapt application source code and rebuild the application to apply changes of the set of monitored methods makes this approach inflexible in terms of adapting the monitoring configuration.
The system described in [3] and [4] combines the ARM framework with bytecode instrumentation, and thus removes the requirement to adapt the application source code to install monitoring code. The described system creates a stack data structure at the thread local storage which maps the current method call stack. The stack data structure is used to correlate method calls to the method execution sequence performed in the local thread of the instrumented methods and can become a severe problem if the level of the nesting becomes high or unpredictable, as it is e.g. possible in recursive method calls. The system places instrumentation code at entries and exits of monitored methods. Entry instrumentations create and initialize a record for storing performance data, and exit instrumentations update the record and send it to an instance which analyzes and visualizes the record. This approach keeps network traffic low because only one data record is sent for each monitored method call, but it causes problems in handling blocked or stalled transactions. In case of a blocked or stalled transaction, in the worst case no method is exited and thus no monitoring data of the blocked transaction is generated. If a monitored method activates another thread, either via explicit thread switch, or by invoking a remote method, the system generates correlation information which identifies the activating method, the thread that executes it and the server which is hosting the virtual machine. The correlation information is transferred to the activated thread and allows correlating the activated thread with the activating instrumented method, but in case of multiple thread activations performed by one monitored method, the provided correlation information is not sufficient to reconstruct the sequence of the multiple thread activations. Knowing the sequence of the activations would be very useful to analyze problems caused by race conditions between the activated threads.
The system described in [5] provides tracing mechanisms which can be dynamically enabled and disabled. The system uses bytecode instrumentation to place entry interceptors and internal interceptors in the monitored application. If a transaction invokes an entry interceptor, the entry interceptor first evaluates a set of rules to decide if evaluation. An interceptor consists of monitoring code placed at the entry and at each exit of instrumented methods. Interceptors produce and send measurement data when executing the code placed at method exits. This leads to problems with blocked transactions, as described before. The system allows tracing transactions which span multiple servers, but it uses timestamp information for correlation, and thus requires synchronized clocks at the involved servers, which is a requirement that is often hard to fulfill.
The system described in [6] also aims to trace distributed transactions, but as the approaches discussed before, it does not address blocked transactions. Additionally, it uses timestamp data for correlation and thus requires synchronized clocks.
Consequently, there is a need for a monitoring system that allows tracing of distributed end-to-end transactions, which overcomes the shortcomings of currently existing approaches.
Additionally, visibility of resources used by the transaction is required, like e.g. heap memory allocations for creation of new objects, performed during transaction execution, or time spent to synchronize with other, concurrent transactions.
Another important missing feature is the ability to distinguish time the transaction actually spent executing, and time the transaction was suspended due to execution maintenance tasks of the underlying virtual machine, like running garbage collector or re-compiling byte-code.                [1] Kazi et al., “JaViz: A client/server Java profiling tool”, IBM SYSTEMS JOURNAL, VOL 39, NO 1, 2000        [2] “Monitoring and Diagnosing Applications with ARM 4.0”, http://www.opengroup.org/tech/management/arm, 2004        [3] Rees et al., “Synthesizing Application Response Measurement (ARM) Instrumentation, Hewlett-Packard, US2005/0039172 A1        [4] Avakian et al., “Using Interceptors and Out-of-Band Data to monitor the Performance of Java 2 Enterprise Edition (J2EE) Applications”, Hewlett-Packard, US2005/0039171 A1        [5] Fung et al., “Method for Tracing Application Execution Path in a Distributed Data Processing System”, Poon Fung, Cupertino, Calif. (US), US200717194664 B1        [6] Maccabee et al., “Application End-to-End Response Time Measurement and Decomposition”, International Business Machines Corporation, US2000/6108700        