Object oriented and bytecode-based software development platforms, including Sun Microsystems' Java and Microsoft's .NET platform, have gained wide acceptance for developing Internet and Enterprise class software applications. Bytecode-based software provides cross-platform and cross-language compatibility and eases the networked integration of software applications.
Remote method invocation available for said platforms, like Sun Microsystems' RMI and Microsoft's .NET Remoting, and messaging services like Sun Microsystems' Java Messaging Service (JMS) or Microsoft's Messaging Queue ease the creation of distributed and loosely coupled application architectures.
Approaches like service oriented architecture (SOA) use this feature to provide flexible application architectures which can be adapted to rapidly changing market demands.
Although this flexibility eases building and updating the functionality of applications, it constitutes a challenge for conventional performance monitoring and tracing tools which traditionally perform monitoring only within the scope of an isolated application. Most existing tools are not prepared to trace transactions over the borders of threads or different virtual machines.
Following the execution path of a transaction over the borders of different threads, processes, or host systems is essential for tracing complete end-to-end transactions, which may be processed by a chain of different application servers that may communicate in various ways.
Information that depicts the different processing stages on different application servers and provides specific performance information for the processing stages is a precondition for performance analysis of distributed applications. To provide such information, it is required to correlate isolated trace information acquired from the different servers participating in a transaction, to depict a consistent end-to-end transaction.
Increasing requirements in transaction visibility and monitoring overhead reduction create demand for more lightweight transaction tracing systems which cause less overhead in terms of processing time and memory usage.
There are some systems available that provide mechanisms for tracing distributed transactions, but those systems either depend on specific properties of the monitored system, such as synchronized clocks of servers involved in distributed transactions, or generate insufficient correlation information. The generated correlation information is sufficient to reconstruct parent-child relationships between parts of distributed transaction executed on different servers, but they fail to reconstruct the exact sequence in which child parts of the transactions were activated. Other systems only provide post-mortem analysis of transactions, or are not able to analyze blocked or stalled transactions.
Some of the existing monitoring systems have memory requirements which are dependent on the nesting depth of executed methods, which makes a prediction of the memory overhead caused by the monitoring system impossible, and which may cause a crash of the monitored system in case of deep nested method executions which may e.g. occur in recursive methods.
JaViz[1] is a monitoring system developed by IBM which allows tracing of distributed transactions running on Java platforms, using Sun Microsystems' RMI framework for communication. The system amends existing virtual machines to make them generate tracing data for each executed method, which is written to trace files. Said trace files contain statistical information about local method calls, outbound remote method calls and inbound remote method service requests. The trace files also contain correlation data which can be used to match outbound remote method calls invoked on one virtual machine with the corresponding inbound remote method service request received on another virtual machine. After a distributed transaction is terminated, a merging tool is executed, which evaluates the correlation data stored in the trace files generated by the involved virtual machines. The merging tool generates an overall trace file which describes the whole distributed transaction. The resulting trace file is interpreted by a visualization tool which provides a tree-like view of the transaction. Although JaViz provides useful information for analyzing distributed transactions, the restriction to post-mortem analysis and the relatively complex handling of the different trace files exclude this approach from usage in productive environments.
The Application Response Measurement framework (ARM) [2], a standard for monitoring application performance, created by Hewlett-Packard and Tivoli WebSites, provides infrastructure for real-time monitoring of distributed transactions. To trace transactions with ARM, calls to ARM methods are inserted at the entry points and all exit points of methods which should be monitored. This requires access to the source code of the application which should be monitored and the ability to rebuild the application after ARM monitoring is included. Accessing the application source is often difficult or even impossible. Additionally the requirement to adapt application source code and rebuild the application to apply changes of the set of monitored methods makes this approach inflexible in terms of adapting the monitoring configuration.
The systems described in [3] and [4] combine the ARM framework with bytecode instrumentation, and thus remove the requirement to adapt the application source code to install monitoring code. The described system creates a stack data structure at the thread local storage which maps the current method call stack. The stack data structure is used to correlate method calls to the method execution sequence performed in the local thread. Memory consumption of said stack data structure grows proportional to the nesting depth of the instrumented methods and can become a severe problem if the level of said nesting becomes high or unpredictable, as is, e.g., possible in recursive method calls. The system places instrumentation code at entries and exits of monitored methods. Entry instrumentations create and initialize a record for storing performance data, and exit instrumentations update said record and send it to an instance which analyzes and visualizes the record. This approach keeps network traffic low because only one data record is sent for each monitored method call, but it causes problems in handling blocked or stalled transactions. In case of a blocked or stalled transaction, in the worst case no method is exited and thus no monitoring data of said blocked transaction is generated. If a monitored method activates another thread, either via explicit thread switch, or by invoking a remote method, the system generates correlation information which identifies the activating method, the thread that executes it and the server which is hosting the virtual machine. Said correlation information is transferred to the activated thread and allows correlating the activated thread with the activating instrumented method, but in case of multiple thread activations performed by one monitored method, the provided correlation information is not sufficient to reconstruct the sequence of said multiple thread activations. Knowing the sequence of said activations would be very useful to analyze problems caused by race conditions between the activated threads.
The system described in [5] provides tracing mechanisms which can be dynamically enabled and disabled. The system uses bytecode instrumentation to place entry interceptors and internal interceptors in the monitored application. If a transaction invokes an entry interceptor, the entry interceptor first evaluates a set of rules to decide if the transaction should be traced and initiates tracing according to the result of the rule evaluation. An interceptor consists of monitoring code placed at the entry and at each exit of instrumented methods. Interceptors produce and send measurement data when executing the code placed at method exits. This leads to problems with blocked transactions, as described before. The system allows tracing transactions which span multiple servers, but it uses timestamp information for correlation, and thus requires synchronized clocks at the involved servers, which is a requirement that is often hard to fulfill.
The system described in [6] also aims to trace distributed transactions, but as the approaches discussed before, it does not address blocked transactions. Additionally, it uses timestamp data for correlation and thus requires synchronized clocks.
Consequently, there is a need for a monitoring system that allows tracing of distributed end-to-end transactions, which overcomes the shortcomings of currently existing approaches.