The exponential growth of Internet connectivity and data storage needs has led to an increased demand for scalable, fault tolerant distributed file-systems for processing and storing large-scale data sets. Large data sets may be tens of terabytes to petabytes in size. Such data sets are far too large to store on a single computer.
Distributed file-systems are designed to solve this issue by storing a file-system partitioned and replicated on a cluster of multiple servers. By partitioning large scale data sets across tens to thousands of servers, distributed file-systems are able to accommodate large-scale file-system workloads.
Many existing petabyte-scale distributed file-systems rely on a single-master design, as described, e.g., by Sanjay Ghemawat, H. G.-T., “The Google File-system”, 19th ACM Symposium on Operating System Principles, Lake George, N.Y. 2003. In that case, one master machine stores and processes all file-system metadata operations, while a large number of slave machines store and process all data operations. File metadata consists of all of the data describing the file itself. Metadata thus typically includes information such as the file owner, contents, last modified time, unique file number or other identifiers, data storage locations, and so forth.
The single-master design has fundamental scalability, performance and fault tolerance limitations. The master must store all file metadata. This limits the storage capacity of the file-system as all metadata must fit on a single machine. Furthermore, the master must process all file-system operations, such as file creation, deletion, and rename. As a consequence, unlike data operations, these operations are not scalable because they must be processed by a single server. On the other hand, data operations are scalable, since they can be spread across the tens to thousand of slave servers that process and store data. Also noted, that metadata for a file-system with billions of files can easily reach terabytes in size, and such workloads cannot be efficiently addressed with a single-master distributed file-system.
The trend of increasingly large data sets and an emphasis on real-time, low-latency responses and continuous availability has also reshaped the high-scalability database field. Distributed key-value store databases have been developed to provide fast, scalable database operations over a large cluster of servers. In a key-value store, each row has a unique key, which is mapped to one or more values. Clients create, update, or delete rows identified by their respective key. Single-row operations are atomic.
Highly scalable distributed key-value stores such as Amazon Dynamo described, e.g., by DeCandia, G. H., “Dynamo: Amazon's Highly-Available Key-Value Store”, 2007, SIGOPS Operating Systems Review, and Google BigTable described, e.g., by Chang, F. D., “Bigtable: A Distributed Storage System for Structured Data”, 2008, ACM Transactions on Computer Systems, have been used to store and analyze petabyte-scale datasets. These distributed key-value stores provide a number of highly desirable qualities, such as automatically partitioning key ranges across multiple servers, automatically replicating keys for fault tolerance, and providing fast key lookups. The distributed key-value stores support billions of rows and petabytes of data.
What is needed is a system and method for storing distributed file-system metadata on a distributed key-value store, allowing for far more scalable, fault-tolerant, and high-performance distributed file-systems with distributed metadata. The challenge is to provide traditional file-system guarantees of atomicity and consistency even when metadata may be distributed across multiple servers, using only the operations exposed by real-world distributed key-value stores.