As the complexity of computer systems and networks of computer systems increase, it becomes more complex and time consuming to trace and resolve problems. This is especially true in large distributed systems where multiple computer programs are concurrently running in multiple computer systems.
Typically, experienced software developers are used to monitor each of these systems and combine the individual analyses in order to obtain a coherent, global view of the operation of the distributed data processing system.
In accordance with current methodologies this is a very manual and labor intensive process, and requires unique skills in the various computer operating environments that make up the distributed system. Furthermore, the inputs to the analysis, such as event and message tracing data, are not in common formats across the various systems. These factors combine to make it a very tedious, error prone, slow and costly process to attempt to correlate these various disparate data traces into a coherent model of the operation of the distributed data processing system.
Furthermore, the traditional error diagnosis processes typically employ a debugger, which is intrusive, or an embedded error logging facility, which normally requires that source code modifications be made.
The deficiencies of the prior art approach to problem identification and resolution have become more prominent as large scale distributed business enterprise systems have been developed, wherein a plurality of different applications running on different hosts and under different operating systems all cooperate via message passing techniques to process input data related to independent and asynchronous transactions. A type of management software known as “middleware” has been developed to control and manage the message flow and processing, and employs message queues to temporally isolate the various applications from one another. In such a system several thousand transactions may be simultaneously in process, resulting in corresponding thousands of Application Program Interface (API) calls and messages being concurrently generated and routed through the system.
As can be appreciated, identifying a cause of a failure or error condition occurring in one or a few of these transactions can be very complex, time consuming and, because of the significant amount of human operator analysis required, error prone.