It is generally a goal of distributed systems, with respect to problem diagnosis, to avoid disastrous scenarios through prompt execution of remedial actions. For example, in IP (Internet Protocol) network management, one would like to quickly identify which router or link has a problem when a failure or performance degradation occurs in the network. In the e-commerce (electronic commerce) context, an objective may be to trace the root-cause of unsuccessful or slow user transactions (e.g., purchase requests sent through a web server) in order to identify whether the problem is a network problem, a web or back-end database server problem, etc. Another example is monitoring, diagnosis and prediction of the health of a large cluster system containing hundreds or thousands of workstations performing distributed computations (e.g., Linux clusters or GRID-computing systems).
One approach to problem diagnosis in distributed computing systems and networks utilizes “probes.” It is known that a probe is an end-to-end transaction (e.g., ping or trace-route command, an e-mail message, a web-page access request, or an e-business transaction) sent from a probing workstation to another component of a distributed system in order to test a particular service (e.g., IP connectivity, database-access or web-access). A probe returns a set of measurements, such as response times and status code (OK/not OK), and is often used to test compliance with the service-level agreements (SLAs).
Probing technology has been used mainly for measuring compliance with an SLA (e.g., IBM Corporation's EPP tool as described in A. Frenkiel et al., “EPP: A Framework for Measuring the End-to-End Performance of Distributed Applications,” Proceedings of Performance Engineering ‘Best Practices’ Conference, IBM Academy of Technology, 1999; and the Keynote product as described in “Using Keynote Measurements to Evaluate Content Delivery Networks” available on the World Wide Web at keynote.com/services/html/product_lib.html), rather than for the purpose of problem diagnosis or problem determination (PD).
Recent work by M. Brodie et al., (e.g., “Optimizing probe selection for fault localization,” Distributed Systems Operation and Management, 2001; “Intelligent Probing: A Cost-Efficient Approach to Fault Diagnosis in Computer Networks,” IBM Systems Journal 41(3): 372-385; and U.S. patent application identified as Ser. No. 10/676,244, now U.S. Pat. No. 6,167,998, filed on Sep. 30, 2003 and entitled “Problem Determination Using Probing.”) proposed to use probing for diagnosis. However, the work focused mainly on pre-planned, fixed probe sets, which are scheduled to run periodically. Because the probe set is computed off-line, it needs to be able to diagnose all possible problems which might occur. However in practice, many of these problems may in fact never happen, and running the complete set of pre-planned probes might be quite wasteful.
Another disadvantage of pre-planned probe sets is that because the probes run periodically at regularly scheduled intervals, there may be a considerable delay in obtaining information when a problem occurs. It is clearly desirable to detect the occurrence of a problem as quickly as possible. Furthermore, once the occurrence of a problem has been detected, additional information may be needed to diagnose the problem precisely. This information may not be obtainable from the results of the pre-planned probes.
Another commonly used approach involves performing event correlation (see, e.g., S. Kliger et al., “A Coding Approach to Event Correlation,” IM 1997; and B. Gruschke et al., “Integrated Event Management: Event Correlation Using Dependency Graphs,” DSOM 1998) for identifying root-causes of problems. Problem determination is performed by analyzing alarms emitted by devices when a problematic situation occurs.
However, in event correlation, unlike the probing scheme, events are “reactive” to a situation and require intensive instrumentation, which is only possible in a tightly managed environment. Moreover, event correlation uses a “passive” approach that requires handling potentially huge volumes of events often unrelated to the problem of interest. Further, in contrast, the probing scheme uses test transactions that can be configured and executed without additional instrumentation of the existing system.
There is also related work on performance measurement based on probing described in V. Paxson, “End-to-end Internet packet dynamics,” Proceedings of SIGCOMM, pp. 139-152, 1997.
Thus, a need exists for improved problem diagnosis techniques for use in accordance with distributed systems.