Historically, customer service utilized a well-trained staff that would listen, answer questions and help customers make decisions. Today, companies expose their data, i.e., products and services, to customers in innovative ways through websites that facilitate self-service by the customers. Enterprise IT infrastructure is used to manage the data and provide the desired products and services to the customers.
In order to monitor the performance of such IT infrastructure, individual “endpoints,” that is, servers, storage and network devices and applications are monitored. The “health” of these endpoints is monitored in terms of metrics such as abnormal resource utilization and response time. While monitoring individual endpoints usually covers systematic problems, situations arise where all endpoints appear to be running well and customer complaints occur regarding an end-user application. These complaints may not indicate any obvious statistical patterns that point to a particular application or device. In such cases, it is necessary to go through logs and other monitoring data in detail and monitor the performance and other characteristics of each unsatisfactory customer transaction as it makes its way through various applications in the system hoping that helpful trends will emerge.
If each transaction had a unique identifier, and if this identifier was recorded in all system and application logs and other monitoring data, then tracking a transaction instance through various applications in the system would be easy. However, this rarely is the case. Instead each portion of an application uses its own identifier for the transactions it serves. The key then is to locate footprints of a given transaction instance (i.e., log records attributed to a specific transaction instance) as it travels through the system without relying on a unique identifier.
An alternative to a system-wide unique transaction identifier is link instrumentation where additional instrumentation is retrofitted between successive applications along the path of a transaction so that transaction footprints in one application on the transaction path can lead to transaction footprints in the subsequent application that comes later. If all links were so instrumented, then starting with the entry point of a transaction into the system, transaction footprints can be located in all the applications traversed by the transaction. However, in reality, one needs to budget and decide which links to instrument and which links to leave for manual matching of footprints.
When a component, e.g., web-server, application server, Lightweight Directory Access Protocol (LDAP) or authentication server, of a distributed application, e.g., shopping sessions in an e-business or new identification (ID) creation processes in an enterprise, does processing of a transaction, that component typically generates at least one log record that indicates a status of the transaction. For example, in the case of an authentication server, a single record per transaction is generated indicating access denied or access permitted.
Unfortunately, transaction monitoring in a distributed application often cannot be implemented directly because the distributed application does not have a unique transaction ID that is maintained through all components of the distributed application and that is included in all log records. In the absence of a unique transaction ID, monitoring a given transaction involves probabilistic calculations that are based on information such as overall traffic flow patterns and aggregate statistics of time spent in each state.
As a result, the estimate of the progress of a transaction in a distributed application can be quite loose, and the usefulness of that estimate for purposes such as debugging a transaction is limited. In order to improve estimates of the progress of a transaction in a distributed application, components of the application are often retrofitted with instrumentation that will allow a unique transaction ID to appear in the logs of instrumented application components. However, retrofitting existing applications with instruments is both costly and error-prone since it requires modifying existing systems. Given the above issues, it is desirable to have methods that guide the selective instrumentation procedure to balance the costs and the risks associated with instrumentation and monitoring performance.