This invention relates to the field of troubleshooting application software problems, analyzing application operations, and gaining insight into an application environment in a distributed data processing system with cooperating applications, specifically to an efficient method for tracing the path of application execution in the complex software environment.
With the advent of packaged software applications, such as Web servers, database servers, and application servers, it gets easier and quicker to put together an application to serve a specific data processing need, as work done by others can be readily leveraged and best-of-breed applications can be utilized. As the use of computer networks become widespread, the interacting and cooperating software applications can be distributed more easily across all types of networks. Concurrently, the improvement in software integration technology, such as the latest Web Services technology, enables all kinds of software applications, modern and legacy applications alike, to be integrated into an application environment without a great deal of effort. All these technical factors result in an increasingly complex application environment where cooperating software applications or application components are distributed over a multitude of computers that in turn are distributed over a large geographic area.
As the complexity of computer networks and application environments grows, it becomes increasingly more difficult to understand the operational behavior of the application environment, and to troubleshoot functional, performance, availability, and security problems that turn up when many applications are integrated together. Specifically, it is difficult to trace the execution path—all applications or application components that are involved in a software task—through many applications in the distributed data processing environment. The difficulty shows up in all phases of a system's lifecycle including development, integration, testing, and production times. The difficulty is particularly acute when a transient production problem is encountered.
Many methods and procedures have been designed to help with the analysis and troubleshooting need. And many products are being sold in the market to address various aspects of this need. Most approaches typically suffer from two drawbacks. First, they collect a tremendous amount of measurement and monitoring data, in so doing they consume a high percentage of computer processing, storage, and communications resources in the data processing system. Secondly, to pinpoint the actual execution path or to identify the root cause of a detected problem takes a relatively long time, usually requiring a time-consuming manual effort to pore over the massive collected data in order to discover relevant data and to relate information.
Log files are a simple and common approach to obtaining operational application data. Most log files generate data on a continuous basis, and thus contain massive amounts of data. In addition, log file format varies widely across applications, making it a big challenge to even relate information from various log files. While voluminous log files consume a high percentage of system resources, their value for quickly pinpointing applications or components thereof or for locating problem sources is marginal.
Another technique is software instrumentation where existing programs are modified in order to collect additional data during a program's execution. The basic method of program instrumentation is to insert program code at various points in the original program, which gets executed together with the original program code. Instrumentation may be done at the program source code level, at the object code level, in software libraries, or at the executable program level. To use software instrumentation for tracking down problem source, one may instrument selected points in the software application hoping that the collected data may lead to the root cause. This is not easy, as it requires an analyst to come up with correct guesses in advance about where the likely causes lie. To avoid missing critical points of interest, the analyst may choose to turn on instrumentation more indiscriminately at many points. The latter approach leads to the same limitations as log files, as it generally results in an enormous amount of data that gets collected. The massive data requires tedious manual analysis in order to produce useful information; at the same time, it consumes a high percentage of system resources for its collection.
One form of software instrumentation is profiling. With profiling, one can determine which program parts run on a computer and how often, and how much time is spent in which program parts. The information a profiler collects generally includes CPU usage, memory allocation, method calls, and various timestamps on method calls. The profiler information can generally be used for identifying performance bottlenecks. But profilers typically generate even more data than log files, and are normally used for a single application, or components thereof. They are inappropriate to be used globally across many applications in a distributed data processing environment, and are definitely too slow to be used in a production environment.
Another technique to collect application information is to extend the SNMP-based (Simple Network Management Protocol) network management systems to cover software applications. SNMP is a simple protocol designed for managing device attributes and connectivity of network elements, and supports only a limited number of data types. As such, the SNMP-based network management model is unsuitable for software applications, as it lacks the capability to model complex relationships among applications.
Some APIs (Application Program Interfaces) have been designed to enable application programs to pass application data to an SNMP network management system. The notable API examples include ARM (Application Response Measurement) by HP and Tivoli, and JMX (Java Management Extension) of the Java J2EE application server platform. But the API technique is still limited by the network element model of SNMP and thus provides no direct means for pinpointing applications or for identifying application-level problem sources.
U.S. Pat. No. 6,108,700, entitled “Application end-to-end response time measurement and decomposition”, describes an elaborate method for measuring the response time of an end-user request and decomposing the overall response time into segments representing contributions from participating applications in a distributed data processing environment. While the method enables the identification of transaction components that introduce delays or faults, its use will likely incur significant system resource overhead due to its complexity.
Thus there is a need for an efficient method that provides direct information on operation or problem location of software applications which incurs minimum system overhead.