By design, the Internet is opaque to its applications, providing best effort packet delivery with little or no information about the likely performance or reliability characteristics of different paths. While this approach is reasonable for simple client-server applications, many emerging large-scale distributed services depend on richer information about the state of the network. For example, content distribution networks like Akamai™, Coral™, and CoDeeN™ redirect each client to the replica providing the best performance for that client. Likewise, voice-over-IP systems such as Skype™ use relay nodes to bridge hosts behind network address translation (NAT) implementations and firewalls, the selection of which can dramatically affect call quality. Peer-to-peer file distribution, overlay multicast, distributed hash tables, and many other overlay services can benefit from peer selection based on different metrics of network performance, such as latency, available bandwidth, and loss rate. Finally, the Internet itself can benefit from more information about itself, e.g., ISPs can monitor the global state of the Internet for reachability and root cause analysis, routing instability, and onset of distributed denial of service (DDoS) attacks.
If Internet performance were easily predictable, its opaqueness might be an acceptable state of affairs. However, Internet behavior is well known to be fickle, with local hot spots, transient (and partial) disconnectivity, and triangle inequality violations all being quite common. Many large-scale services adapt to this state of affairs by building their own proprietary and application-specific information plane. Not only is this approach redundant, but it also prevents new applications from leveraging information already gathered by other applications. The result is often suboptimal. For example, most implementations of the file distribution tool BitTorrent™ choose peers at random (or at best, using round trip latency estimates); since downloads are bandwidth-dependent, this approach can yield suboptimal download times. By some estimates, BitTorrent accounts for roughly a third of backbone traffic, so inefficiency at this scale is a serious concern. Moreover, implementing an information plane is often quite subtle, e.g., large-scale probing of end-hosts can raise intrusion alarms in edge networks because the traffic can resemble a DDoS attack. This characteristic is the most common source of complaints on PlanetLab.
To address this concern, several research efforts, such as IDMaps™, GNP™, Vivaldi™, Meridian™, and PlanetSeer™ have investigated providing a common measurement infrastructure for distributed applications. These systems provide only a limited subset of the performance metrics of interest—most commonly latency between a pair of nodes, whereas most applications desire richer information such as loss rate and bandwidth capacity. By treating the Internet as a black box, most of these services abstract away network characteristics and atypical behavior—exactly the information of value for troubleshooting as well as improving performance. For example, the most common latency prediction methods use metric embeddings, which are fundamentally incapable of predicting detour paths, since such paths violate the triangle inequality. More importantly, being agnostic to network structure, they cannot pinpoint failures, identify the causes of poor performance, predict the effect of network topology changes, or assist applications with new functionality, such as multipath routing.
Accordingly, it would be desirable to move beyond mere latency prediction and develop a service to automatically infer sophisticated network behavior. Such a system should be able to measure or predict a plurality of different performance metrics affecting communication over the Internet (or some other form of wide area network) between two arbitrarily selected end-hosts, without requiring that the measurements of any of the performance metrics be initiated or actually carried out by either of the end-hosts. The data used to determine such performance metrics should be automatically collected and updated without normally requiring any user interaction. Further, the system that collects the data necessary for providing the performance metrics should be relatively efficient and not impose an undue burden in regard to the traffic required to collect the data.