Recently, distributed hash table (DHT) overlay networks have been used to solve the problem of data placement and retrieval in large scale, Internet-sized storage systems. These systems generally include distributed network systems implemented, for example, using peer-to-peer (P2P) networks for storing vast amounts of data. The overlay networks are logical representations of the underlying physical networks, which provide, among other types of functionality, data placement, information retrieval, routing, etc. Some examples of DHT overlay networks include content-addressable-network (CAN), PASTRY, and CHORD.
Data is represented in an overlay network as a (key, value) pair, such as (K1,V1). K1 is deterministically mapped to a point P in the overlay network using a hash function, e.g., P=h(K1). The key value pair (K1,V1) is then stored at the point P in the overlay network, i.e., at the node owning the zone where point P lies. The same hash function is used to retrieve data. The hash function is used to calculate the point P from K1. Then the data is retrieved from the point P. This is further illustrated with respect to the 2-dimensional CAN overlay network 900 shown in FIG. 9.
A CAN overlay network logically represents the underlying physical network using a d-dimensional Cartesian coordinate space on a d-torus. FIG. 9 illustrates a 2-dimensional [0,1]×[0,1] Cartesian coordinate space in the overlay network 900. The Cartesian space is partitioned into CAN zones 910-914 owned by nodes A-E, respectively. The nodes A-E each maintain a coordinate routing table that holds the IP address and virtual coordinate zone of each of its immediate neighbors. Two nodes are neighbors if their zones overlap along d-1 dimensions and abut along one dimension. For example, nodes B and D are neighbors, but nodes B and C are not neighbors because their zones 911 and 914 do not abut along one dimension. Each node in the overlay network 900 owns a zone. The coordinates for the zones 910-914 are shown.
Routing in the overlay network 900 is performed by routing to a destination node through neighboring nodes. Assume the node B is retrieving data from a point P in the zone 914 owned byte node C. Because the point P is not in the zone 911 or any of the neighboring zones of the node B, the request for data is routed through the neighboring zone 913 owned by the node D to the node C owning the zone 914 where point P lies to retrieve the data. Thus, a CAN message includes destination coordinates, such as the coordinates for the point P, determined using the bash function. Using the source node's neighbor coordinate set, the source node routes the request by simple greedy forwarding to the neighbor with coordinates closest to the destination coordinates, such as shown in the path B-D-C.
Without considering proximity information about nodes, CAN and other types of overlay networks operate far less efficiently than what is optimally possible. For example, referring to the CAN overlay network 900, the node B may select the node D when routing to the point P, because node D's coordinates may be closer to the destination than node A's coordinates. However, the number of logical hops in the overlay network 900 may be much less than the number of network hops in the physical network when routing to die destination node. For example, there may be 100 network hops in the path B-D-C and 50 network hops in the path B-A-C. Thus, by not considering the underlying network topology and selecting die path with more network hops, more network traffic is generated and latencies are increased.