1. Technical Field
This invention generally relates to distributed systems and computer networks. Specifically, this invention relates to diagnosing performance bottlenecks and discovering dependencies in distributed systems and computer networks.
2. Description of Background
Monitoring and diagnosis of distributed computer systems and networks is an important issue in systems management that becomes increasingly challenging as size and complexity of such systems increases. Given the heterogeneous, decentralized, and often non-cooperative nature of typical large-scale networks, it is impractical to assume that all statistics related to an individual system's components (e.g., links, routers, and application-layer components) can be collected for monitoring purposes. However, other type of measurements, including end-to-end transactions (or probes), may be relatively easier to obtain.
Consider diagnosing performance problems in a distributed system given end-to-end performance measurements provided by test transactions, or probes. For example, a ping is an example of a probe, and the corresponding performance metric is the end-to-end delay. The end-to-end delay is a combination of delays at all internal components of the system that the probe traveled through. Similarly, response time of an application-layer probe (e.g., http access, database access, etc) is a combination of delays at all (i.e., both software and hardware) components that the probe traveled through. Conventional techniques for problem diagnosis such as, for example, codebook and network tomography typically assume a known dependency matrix (or, in case of network diagnosis, routing matrix) that describes how each probe depends on the system's components.
However, obtaining dependency information may be too costly or infeasible in many situations. For example, network topology and routing information may be unavailable due to non-cooperative administrative domains blocking access to topology discovery tools, components that affect probe's performance may be hard to discover (e.g., low level network elements or high-level application components such as particular set of database tables crucial for transaction performance), maintaining up-to-date information about dynamically changing routing (e.g., in wireless and mobile networks) may get costly, and constructing dependency matrices for application-level transactions is typically quite a laborious process requiring expert knowledge of system components that may affect a probe's performance.