1. Field
This disclosure relates generally to computer networks and, more particularly, to a scalable, high-throughput data storage and coordination mechanism for use in distributed applications.
2. Background
The high degree of networking among computer systems and the need to support distributed applications has led to the use of distributed data storage networks. Distributed data storage networks include a plurality of storage nodes and provide a plurality of clients with storage areas. The storage nodes may be connected to each other through a network. In response to a client storing data in the distributed storage system, the distributed storage system stores the data in such a way to distribute a predetermined number of replicas of the data to other storage nodes. Such data replication may enable faster retrieval of the data because the data can be retrieved from the node that is closest or fastest. Data replication may also result in increased available network bandwidth by reducing the need to forward data requests and data transfers throughout the network. Data replication may also increase the fault tolerance of an application, since if one node fails, the necessary data can still be obtained from another node that is still operational.
Some known data storage applications have employed distributed hash tables (DHTs) for storing data. Some examples DHTs include Chord, Content-Addressable Network (CAN), Pastry, Tapestry, Symphony, Kademlia, and Dynamo. In conventional DHTs such as Chord, a hash key generated from the data to be stored must be passed sequentially from node-to-node around a ring of computing nodes until a “matching” node is identified where the data is stored. The data is then stored on the matching node and metadata is created to maintain the location of the stored data. To retrieve the stored data, a hash key is generated from the request, the metadata is accessed using the hash key to identify the node on which the data is stored, and the data is read from the identified node.
As distributed data storage networks have become larger and more complex, however, the issue of storage management has become a great challenge. There is a need for distributed storage and coordination of vast amounts of data on a network that can support tens of thousands of simultaneous clients and thousands of storage servers without sacrificing performance as the system scales.