Timely detection of performance degradations and/or unavailability of service providers is crucial to providing high quality of service (QoS) in distributed systems, particularly in very large-scale ones, such as computational grids and data grids. This becomes especially important when service providers are unreliable peers in peer-to-peer or grid systems, where the peers can join and leave the system at arbitrary points in time. Directly measuring the performance/availability of each peer on a regular basis can be quite costly, or even impossible, in very large-scale and highly-dynamic systems. Clearly, such a proactive approach would not scale with the size of a system.
Nonetheless, many distributed applications including peer-to-peer and grid computing systems would function more effectively by detecting the performance/availability and the quality of service provided by service providers. The term “service provider” as used herein refers to, for example, a server providing a service over a network, and not to a general IP carrier network. The purpose of detection is to allow adjustments in use of infrastructure to assure performance of service providers and to achieve better scalability. Both peer-to-peer and grid computing systems typically operate over unreliable or variable-performance distributed environments. It is well-known that such dynamic behavior in communication channels results from shared use of computation and communication resources, such as bandwidth, communication time, computation CPU time, or disk space.
Two modes can be adopted to determine service status of a service provider accessed over a distributed or networked system—the heretofore-mentioned proactive mode or a reactive mode. In the proactive mode, status information is updated periodically or whenever there is a change. In a reactive mode, status is gathered only when it is needed. Active discovery of status incurs overhead, both in the discovery itself, and in the maintenance of current status information (awareness of the system). But accurate and timely status information is needed to provide better services for clients (or consumers) and to maintain a scalable system. Therefore, a decision has to be made about how often and when to probe or detect the status of service providers, or how to categorize service quality.
Event correlation is a commonly-used approach for problem determination in distributed systems. Event correlation seeks to match event combinations with potential failures in a system. However, this approach assumes the availability of a “codebook” which identifies each problem that may be diagnosed and corresponding event combinations that will accompany an occurrence of the problem. Probing techniques constitute a similar approach for problem diagnosis, where it is assumed that there is a set of possible end-to-end test transactions (probes); a set of system components; and a “dependency matrix” specifying which components each probe examines. The most recent work on active probing provides a considerably more efficient approach (sometimes up to 70% and higher) than codebook and “passive” probing, by actively selecting a next most-informative probe.
However, in many real systems, no dependency information (i.e., no dependency matrix or codebook) is readily available. Accordingly, those skilled in the art seek an alternative for determining availability and performance of service providers in a distributed system. In particular, those skilled in the art seek methods and apparatus that minimize the need for developing a priori a comprehensive understanding or codebook that documents relationships between problems and associated event occurrences; that generally minimize the need for active probing of service provider status; and that use information, where available, to determine availability and performance of service providers in a distributed system.