Monitoring systems capable to perform monitoring and tracing of individual distributed transaction and to provide data describing internal execution details, performance and behavior of each individual monitored transaction have gained wide popularity amongst application operators. The reason of this popularity is the fine grained level of information that is provided by those systems, which allows application operators to pinpoint the root causes of detected behavior or performance problems, even if they only affect a small number of monitored transactions.
Typically, such monitoring systems deploy agents to processes involved into the execution of monitored transactions. Those agents identify portions of distributed transactions executed on the process, capture execution details of those portions, like data describing individual method executions. To allow the identification and correlation of portions of distributed transactions performed by different threads on different processes or computer systems, the deployed agents also monitor incoming and outgoing communication performed by the processes they are deployed to, attach correlation data to outgoing communication data and read correlation data from incoming correlation data. This correlation data passed with communication data allows a correlation process to identify and correlate corresponding trace data describing communicating parts of a distributed transaction and allows the correlation process to reconstruct end-to-end transaction trace data describing the execution of the distributed transaction.
The agents create and send their transaction trace and monitoring data to a central correlation server operates a correlation process that analyses the transaction trace data fragments and combines them into individual end-to-end transaction traces.
As each agent runs separately and independently from each other agent, and they only use a portion of processing resources from the processes they are deployed to, there is no limiting factor for the number of monitored processes from the agent side. The central correlation server side, which has to receive and process all tracing data from connected agents quickly becomes a bottle neck. For larger application monitoring setups with a high number of monitored processes and with high transaction load, the processing and memory requirements of such a centralized correlation process quickly become unrealizable either in terms of financial resources to provide adequate hardware, or even due to technical impossibility to fulfill those hardware requirements.
Distributing the correlation load to a set of correlation servers that process the received transaction trace data in parallel, would remove this bottleneck and would allow such transaction monitoring systems to scale better by the number of monitored processes and transactions.
However, the kind of transaction trace data portions provided by the agents, that describes portions of transaction executions by one process that need to be correlated with corresponding other portions of transactions executed by other processes and provided by other agents, does not allow a static, agent based, segmentation of the correlation processing load without causing undesired cross communication between the correlation servers in the cluster. Theoretically, portions of individual distributed transactions may be executed on any monitored process. Consequently, trace data fragments describing those transaction portions may be provided by any agent. Therefore, transaction trace data fragments from all agents may potentially be required to create end-to-end transaction trace data. In a distributed correlation process executed by a set of correlation servers connected by a computer network and forming a correlation cluster, each correlation server only receives a subset of the transaction trace data fragments. As a consequence, correlation servers would need to communicate with other correlation servers to request missing trace data fragments, as transaction trace data fragments from one distributed transaction may be sent to different correlation servers. This causes undesired network communication between the correlation servers that slows down the correlation process and that also requires a high amount of network bandwidth. In the worst case, adding a correlation server to the cluster may exponentially increase the network bandwidth usage.
In case e.g. each agent would be assigned to a specific correlation server in the cluster, the agents would provide transaction trace data fragments of transaction portions executed on the process to which it is deployed to the correlation server they are connected to. To complete transaction trace data fragments received by one correlation server to form end-to-end transactions, the correlation server would constantly need to request corresponding transaction trace data fragments from its peer correlation servers in the cluster. This would result in large amount of network traffic between the correlation servers which would quickly become another bottleneck for the correlation process.
Consequently, there is a need in the art for a system and method that allows clustered correlation of transaction trace data received from independently operating agents that requires no or only a minimum of communications between the correlation servers forming the cluster.
This section provides background information related to the present disclosure which is not necessarily prior art.