Field
The disclosed embodiments generally relate to data storage systems. More specifically, the disclosed embodiments relate to the design of an append-only data storage system that stores data blocks in extents and facilitates efficiently erasure-coding extents to provide fault tolerance.
Related Art
Organizations are presently using cloud-based storage systems to store large volumes of data. These cloud-based storage systems are typically operated by hosting companies that maintain a sizable storage infrastructure, often comprising thousands of servers that are sited in geographically distributed data centers. Customers typically buy or lease storage capacity from these hosting companies. In turn, the hosting companies provision storage resources according to the customers' requirements and enable the customers to access these storage resources.
Cloud-based storage systems often store sets of data items in large data objects called “extents” that can be many megabytes or even gigabytes in size. These extents can be replicated across multiple disks and machines to provide fault tolerance. For example, an extent can be replicated so that four copies of the extent reside on four separate machines. The process of replicating an extent is both simple and fast. However, replicating an extent requires a lot of storage space. In the above example, replicating an extent four times requires four times the storage space of a single extent. To conserve storage space, it is desirable to use a more-efficient scheme to provide fault tolerance, such as erasure-coding (e.g., using Reed-Solomon codes). For example, an RS(6,3) Reed-Solomon code takes six data symbols and generates three additional parity symbols to produce a nine-symbol codeword that can tolerate errors. Note that the RS(6,3) Reed-Solomon code can be used to encode extents in a fault-tolerant manner and only requires 50% additional storage space.
However, using erasure codes to provide fault tolerance for extents can create complications. If an extent is partitioned into a number of subsections that are erasure-coded and then stored on different machines, the process of retrieving the extent becomes complicated and time-consuming because the extent must be retrieved in subsections from multiple machines. Also, the process of indexing a data item within an extent becomes more complicated if the extent is partitioned across different machines.
Hence, what is needed is a technique for providing fault tolerance for extents without the drawbacks of existing techniques.