Background description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed invention, or that any publication specifically or implicitly referenced is prior art.
Generally, Erasure coding is a technique to encode N equalized data elements into N+M same sized data elements for transmission or storage so that the original N data items can be recovered in the presence of up to M failures in the encoded N+M data elements. Such an encoding scheme is called a N+M erasure coding scheme. The space overhead of a N+M scheme is M/N. In most erasure coding schemes, the N input blocks are directly transmitted (or stored) unmodified to the output, but they are augmented with the M calculated ‘parity’ blocks. For example, a simple 2+1 coding scheme takes two equalized items, A and B, and produces an additional C which is the bit −XOR of A and B and transmits/stores A, B, and C with a space overhead of 50%. This scheme clearly tolerates a failure in any of A, B, or C, as any one of them can be recovered from the other two:A=B XOR CB=A XOR CC=A XOR B
Erasure coding schemes can be quite complex and computationally expensive. Reed-Solomon encoding is the general technique used. The above XOR/parity scheme is an instance of Reed-Solomon codes, but they are far more complex when the intent is to be able to tolerate more than one failure. N+2 (with N often 8) Reed-Solomon codes are often used in Redundant Array of Inexpensive Disks (RAID) storage. Of course, such a scheme can be used with any storage medium (e.g. flash/SSD, persistent memory, etc.).
The traditional use of erasure coding for local storage is to take N+M identically sized disks and make them appear as one N times larger logical disk with internally redundantly encoded data that tolerates up to M disk failures (partial or total). The larger disk can then be software partitioned into logical partitions (logical disks) of any size. For example, 10×1 TB disk drives can be made to appear as a 1×8 TB logical disk drive with two-way redundancy by using an 8+2 erasure coding scheme such as Reed-Solomon. This allows the storage system to tolerate up to two disk failures with a space overhead of 25% (2/8). The logical 1×8 TB disk can then be software partitioned back into 8×1 TB logical disks, or any other combination (e.g. 1×5 TB logical partition plus 3×1 TB logical partitions). Although this form of erasure coding is space efficient, it comes at a performance cost, especially for writes.
Further, when reading a block of data, and assuming no disk failures, the block of data can be read directly from one of the unmodified copies (the right one of the N). The ‘parity’ blocks are only used to recover the original data in case one of the original N is unavailable either due to a partial or total disk failure.
However, when a block of data is written to the logical 8 TB disk, this block must be combined with the contents of the other N−1 (7 in this example) data items in order to recalculate the M parity blocks. This requires (even absent failures) reading N−1 blocks and writing N+M blocks. In other words, absent failures, reads do not require any additional operations (just the read), but writes are amplified tremendously, as a write of size S requires N−1 reads of size S and N+M writes of size S.
Although writes are generally more latency tolerant than reads (they can be buffered), the extra bandwidth requirements (both reads and writes) can become a bottleneck for the system, and this is unfortunate as many storage system applications require far more write bandwidth than read bandwidth as the software stack above the storage system often uses a large DRAM cache that significantly reduces the number of reads required. For example, databases have large in memory caches so that any locality to the data accesses result in fewer reads from the underlying storage system. Similarly, file systems have file/buffer caches that exploit locality to avoid reads. It is not uncommon that for real applications, the write to read proportion (in bytes, not IOPs) is 80/20 or more. Note that this amplification is particularly problematic for the newer storage media (e.g. flash/SSD, persistent memory) as in these newer storage media reads are much cheaper than writes, so any amplification of writes into more writes is undesirable. Additionally, For hard disk drives, reads are more expensive than writes as writes can often be buffered to exploit track locality, but reads often don't have such good locality as any locality in the reads has already been consumed by the in memory caches in the applications or file system.
Although the aforementioned description is in terms of a local storage system (single computer with multiple disk devices), the same applies to a distributed storage system where some of the data is on different computers (each possibly with multiple disk devices) and some of the data is in other computer, and the storage system handling the distribution of the data to provide the appearance of one single storage device. In such a distributed system, the reads require network round trips which further add latency to the writes and that can reduce the effective bandwidth of writes as well. But the situation is conceptually very similar to the single computer system although, in practice, distributed storage systems use mirroring rather than erasure coding for redundancy due to these performance issues, at a larger space overhead: to tolerate 1 failure, mirroring requires 100% space overhead, to tolerate 2 failures, mirroring requires 200% space overhead, and so on, while with erasure coding the space overhead can be made very small (just make N larger for a small M).
Thus, there is a dire need to provide a system and method that efficiently and economically utilize larger block sizes for a logical disk and further, decomposes into smaller physical block sizes for a redundant encoding by utilizing an erasure coding logic to avoid a read-modify-write operation on a plurality of write operations. Further, there is also a need for an erasure coding scheme where absent failures or the writes result in no amplification even if the reads result in some amplification.
All publications herein are incorporated by reference to the same extent as if each individual publication or patent application were specifically and individually indicated to be incorporated by reference. Where a definition or use of a term in an incorporated reference is inconsistent or contrary to the definition of that term provided herein, the definition of that term provided herein applies and the definition of that term in the reference does not apply.
In some embodiments, the numbers expressing quantities or dimensions of items, and so forth, used to describe and claim certain embodiments of the invention are to be understood as being modified in some instances by the term “about.” Accordingly, in some embodiments, the numerical parameters set forth in the written description and attached claims are approximations that can vary depending upon the desired properties sought to be obtained by a particular embodiment. In some embodiments, the numerical parameters should be construed in light of the number of reported significant digits and by applying ordinary rounding techniques. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of some embodiments of the invention are approximations, the numerical values set forth in the specific examples are reported as precisely as practicable. The numerical values presented in some embodiments of the invention may contain certain errors necessarily resulting from the standard deviation found in their respective testing measurements.
As used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.
The recitation of ranges of values herein is merely intended to serve as a shorthand method of referring individually to each separate value falling within the range. Unless otherwise indicated herein, each individual value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context.
The use of any and all examples, or exemplary language (e.g. “such as”) provided with respect to certain embodiments herein is intended merely to better illuminate the invention and does not pose a limitation on the scope of the invention otherwise claimed. No language in the specification should be construed as indicating any non-claimed element essential to the practice of the invention.
Groupings of alternative elements or embodiments of the invention disclosed herein are not to be construed as limitations. Each group member can be referred to and claimed individually or in any combination with other members of the group or other elements found herein. One or more members of a group can be included in, or deleted from, a group for reasons of convenience and/or patentability.