Field
The disclosed embodiments generally relate to data storage systems. More specifically, the disclosed embodiments relate to the design of a data storage system that uses colocation hints provided by a client application to facilitate accessing data blocks in a distributed data storage system.
Related Art
Organizations are beginning to use cloud-based storage systems to store large volumes of data. These cloud-based storage systems are typically operated by hosting companies that maintain a sizable storage infrastructure, often comprising thousands of servers that that are sited in geographically distributed data centers. Customers typically buy or lease storage capacity from these hosting companies. In turn, the hosting companies provision storage resources according to the customer's requirements and enable the customers to access these storage resources.
To provide fault tolerance, data items are often replicated across different storage devices. In this way, if a specific storage device fails, the data items on the failed storage device can be accessed from other storage devices. To provide an even higher level of fault tolerance, data items can be replicated across geographically distributed data centers. In this way, if a data center fails (or becomes inaccessible), copies of the data items on the data center can be accessed from another data center.
For efficiency reasons, it is desirable to “colocate” copies of a related set of data items at the same data center. In this way, an application can access the set of related data items from a single data center, without having to perform a large number of slow accesses to remote data centers. For example, colocating a set of associated files at the same data center enables a client application to efficiently perform a keyword search through the set of files. In contrast, if the set of files needs to be accessed from multiple data centers, the same keyword search would be extremely time consuming.
Also, if a set of data items is replicated across multiple data centers, it is desirable for each data center holding such data items to have a complete copy of the set of data items. In this way, if a data center containing the set of data items fails, the set of data items can be accessed from another data center that contains a complete copy of the set of data items. This is more efficient than accessing the set of data items from multiple data centers.
Hence, what is needed is a system that facilitates colocating related data items within a distributed data storage system.