1. Field of the Invention
The present invention relates generally to the data processing field and, more particularly, to a method and apparatus for identifying the cause for a transaction response time problem in a distributed computing system.
2. Description of the Related Art
Distributed computing is a type of computing in which a plurality of separate computing entities interconnected by a telecommunications network operate concurrently to run a single transaction in a transparent and coherent manner, so that the plurality of entities appear as a single, centralized system. In distributed computing, a central server divides a single transaction into a plurality of work packages which are passed to a plurality of subsystems. Each subsystem performs the particular sub-transaction detailed in the work package passed to it, and when it is finished, the completed work package is passed back to the central server. The central server compiles the completed work packages from the various subsystems and presents the results of the transaction to an end user.
In a distributed computing system, it is important to monitor the operation of each subsystem so that the root cause of any transaction response time problem that may occur can be detected and identified. A current technique for identifying the root cause of a transaction response time problem is to attach an Application Response-time Measurement (ARM) correlator to the transaction so that response time information can be gathered at each subsystem, and then correlated at the central server.
In order to reduce the amount of data that is stored locally at each subsystem to be sent over the network to the central server to be correlated, the data collected on a subsystem for each run of a transaction is aggregated over a one hour period. The locally stored aggregated data is sent to the central server on an hourly basis; and after being sent, the locally stored data is normally discarded at the subsystem. Upon completion of a transaction, if a monitor, located on a monitored server where the transaction originated, determines that the transaction exceeded a response time threshold, it will turn on a flag in the ARM correlator for subsequent runs of the transaction to save the instance data (Second Failure Data Capture) which is also needed to perform a “Root-Cause Analysis”. A Root Cause Analysis cannot be performed on aggregate data alone because the granularity of aggregate data is too high and, thus, may hide the problem. The Root-Cause Analysis must instead be performed using both the aggregate data and the instance data of the specific transaction in question. The instance data is compared to an average of the aggregate data to determine the sub-transaction that is operating outside the norm represented in the aggregate data.
There are several drawbacks to current techniques for identifying the root cause of a transaction response time problem in a distributed computing system. For one, an aggregate view of the transaction path may not isolate the subsystem where the problem is occurring if the problem is sporadic in nature. In addition, collecting subsequent instances of a transaction run may or may not identify performance problems having the same root cause. In some cases, for example, a transaction may be initiated by a different user from a different location, or may contain different parameters, all of which can impact the outcome of the transaction. Additionally, today's web server environments are often clustered and load balanced, and, as a result, a transaction may not take the same path on subsequent runs as during the actual failure. If a specific transaction path has a problem, but the transaction takes a different path the next time it is executed, the monitoring product would falsely determine that the problem has corrected itself—when, in fact, the problem will resurface once the transaction takes the original path in the future.
Another drawback to current techniques for identifying the root cause of a transaction response time problem in a distributed computing system is that current techniques rely on the user of the monitoring product to analyze the data of the aggregate transaction and the subsequent instances to identify the source of the problem. The event that is sent to the user does not itself give an indication of the cause of the problem because at the time of the event, it is not known which subsystem caused the overall transaction problem.
There is, accordingly, a need for an improved method and apparatus for identifying the cause for a transaction response time problem in a distributed computing system.