Information that is used to access a stored digital item is referred to herein as the “access key” of the stored item. In typical file systems, stored items are retrieved based on (a) the location at which the items are stored, and (b) a name or identifier of the items. For example, if a file named “foo.txt” is located in a directory named “c:\myfiles\text”, then applications may use the pathname “c:\myfiles\text\foo.txt” as the access key to retrieve the file from the file system. Because conventional access keys are based on the location of the items being retrieved, the access keys change when the items are moved. In addition, each copy of an item has a different access key, because each copy is stored at a different location.
In contrast to conventional file systems, Content Addressable Storage (CAS) systems allow applications to retrieve items from storage based on a hash value that is generated from the content of the items. Because CAS systems perform storage-related operations on items based on the hash values generated for the items, and the hash values are based on the content of the items rather than where the items are stored, the applications that request the operations may do so without knowing the number or location of the stored copies of the items. For example, a CAS system may store multiple copies of an item X at locations A, B and C. An application that desires to retrieve item X would do so by sending to the CAS system a hash value that is based on the contents of item X. Based on that hash value, the CAS system would provide to the application a copy of item X retrieved from one of the locations A, B, and C. Thus, the application would obtain item X without knowing where item X was actually stored, how many copies of item X existed, or the specific location from which the retrieved copy was actually obtained.8
Data stored using CAS is often replicated across two or more data centers. When a set of chunks is replicated across multiple data centers, the many replicas of the chunk set are supposed to remain identical. However, in practice, chunk set replicas have small differences. These differences may result from a variety of causes, including data corruption and replication latency.
FIG. 1 is a block diagram that illustrates a scenario in which replicas of the same chunk set are contained in two chunk stores 100 and 102, which may be maintained by two data centers. The replica that is stored in chunk store 100 is shown as replica A, while the replica that is stored in chunk store 102 is shown as replica B. In both chunk stores 100 and 102, the replicas are partitioned across two storage devices. Specifically, replica A is partitioned across storage devices 110 and 120 using horizontal partitioning, while replica B is partitioned across storage devices 112 and 122 using vertical partitioning.
Horizontal partitioning involves selecting where to store chunks based on the range into which their access keys fall. Thus, in chunk store 100, chunks with the access keys in the range MIN to N are stored on storage device 110, while chunks with access keys in the range N+1 to MAX are stored on storage device 120.
In contrast, vertical partitioning involves storing all chunks on a particular device up to a particular point in time (e.g. when the disk becomes full), and then after that point in time storing all new chunks on a different device. Thus, in chunk store 102, chunks for the entire access key range MIN to MAX are stored on storage device 112 until time T1, and after time T1 chunks for the entire access key range MIN to MAX are stored on storage device 122.
Chunk stores 100 and 102 are merely simple examples of how data centers may internally organize the chunks that belong to the replicas they maintain. The organization of the chunk data may become arbitrarily complex, involving a combination of horizontal and vertical partitioning, as well as within-system replication and caching. The techniques described herein are not limited to any particular internal chunk store organization.
As mentioned above, replicas A and B may deviate from each other due to data corruption or latency issues. With respect to data corruption, data-corruption-produced deviation between replicas may occur, for example, when disks fail, when individual sectors of a disk fail, or when stored data becomes scrambled. In addition, NAND chips (aka SSD) have progressive decay that may result in corruption of the data stored therein.
Even in the absence of any failure, replicas A and B may differ because of latency-produced deviation. Specifically, replication takes some time, and the replicas continue to evolve (i.e. new PUT operations are being done) while the replication proceeds. Thus, even if it were possible to perform an instantaneous comparison of the state of replicas A and B, the replicas would differ because some chunks that chunk store 100 has finished storing into replica A have not finished being stored into replica B by chunk store 102, and visa-versa.
It is possible to adopt protocols that attempt to pro-actively avoid corruption-produced deviations. For example, in some systems, PUT operations are sent to all replicas (e.g. all 3 replicas) but are assumed to succeed if a majority of replicas acknowledge the PUT (e.g. 2 out of 3 replicas). In the case where a replica has not acknowledged a PUT, the replication system typically exerts best efforts make sure the replica that did not acknowledge the PUT ends up having the chunk being PUT.
As another example, when a request to retrieve a chunk is made based on the access key of the chunk, the system may check whether the requested chunk was found at all replicas. If any replica failed to find the requested chunk, a copy of the chunk (obtained from a replica that succeeded in finding the chunk) may be PUT in each replica in which the retrieval operation failed.
Unfortunately, such pro-active efforts to prevent or recover from corruption-produced deviation cannot guarantee that replicas will not remain in a corrupt state indefinitely. For example, if a particular chunk becomes scrambled in replica A, then the corruption of the chunk may go undetected as long as the particular chunk is not the subject of a subsequent GET or PUT operation. Consequently, approaches have been developed for periodically checking the consistency between the replicas of a chunk set.
One approach for checking the consistency between replicas of a chunk set is referred to herein as the “ALGO1” approach. According to the ALGO1 approach, differences across two replicas are detected by comparing (a) the set of access keys of all chunks in one replica with (b) the set of access keys of all chunks in another replica, and computing the differences between the two sets. This algorithm would require transmission of O(P) access keys (where P is the number of chunks in the chunk store). With large values of P (e.g. >10**12), this is not practical.
The access keys of chunks in a replicated chunk set are often referred to as “hashes”, because they are typically generated by applying a hash function to the content of chunks. The set of all hashes of a chunk store can be represented by a ring, denoting the hashes in lexicographic order. The full set of hashes (that is, the entire range of access keys) is represented by the whole ring, while a range of hashes can be represented by a slice of the ring, a “hash slice”.
Referring to FIG. 3A, it is a block diagram that depicts rings 300 and 302 that respectively represent the range of access keys used by chunk stores 100 and 102. Ring 300 has been subdivided into four slices A1, A2, A3, and A4. Similarly, Ring 302 has been subdivided into four slices B1, B2, B3, and B4. Each slice represents a hash value range. In the illustrated example, rings 300 and 302 have been divided such that each slice in ring 300 has the same hash range as a corresponding slice in ring 302. Thus, slice A1 corresponds to the same hash range as slice B 1, slice A2 corresponds to the same has range as B2, etc. Two slices that correspond to the same hash range are referred to herein as a slice pair. Thus, slices A1 and B1 form a slice pair, slices A2 and B2 form a slice pair, etc. In FIG. 3A, the slice pair formed by slices A2 and B2 is identified as slice pair 304, while the slice pair formed by slices A3 and B3 is identified as slice pair 306.
Various techniques can be used to divide the ring into slices. For example, the whole ring may be subdivided in two, resulting in one slice for access keys MIN to ½MAX and one slice for access keys ½MAX+1 to MAX. If the resulting slices are not sufficiently small, then each slice may be further divided into two. This process of subdividing slices may be repeated any number of times until the slices are sufficiently small. Thus, the whole ring can be divided into 2**N hash slices, for N=0, 1, 2, etc. In FIG. 3A, the rings have been divided into four slices (i.e. N=2).
By dividing the rings 300 and 302 of chunk stores 100 and 102 into hash slices, “ALGO1” may be refined to be more efficient. The refinement “ALGO2” involves dividing the rings into 2**N hash slices, and then computing the per-slice-pair differences. For example, the access keys in slice A1 can sent from chunk store 100 to chunk store 102 to be compared to the access keys in slice B1. Then the access keys in slice A2 can be sent from chunk store 100 to chunk store 102 to be compared to the access keys in slice B2, etc. This cuts the between-store transmission of hashes into smaller packets, which is more practical, and lends itself parallelism.
Unfortunately, ALGO2 still ends up exchanging O(P) hashes, and does not focus where differences are. For example, in the extreme case where only 1 hash differs between replicas A and B, the amount of work done using ALGO2 is still O(P)—expensive. In normal operation, P is very large are there are relatively few differences.
To avoid exchanging O(P) hashes, another technique “ALGO3” uses Merkle trees (or hash trees) to compute and maintain a tree of checksums. The checksum generation technique is chosen such that two equal sets have a very low probability of being different if the checksum is the same. An example of checksum is to XOR the hashes comprising a set. ALGO3 works per slice, first producing a checksum. If the checksum is the same for the corresponding slice in the replica, then the algorithm considers the slices are the same. If the hashes differ, hashes are enumerated, like in ALGO2. In the worst case, O(P) hashes are enumerated like in ALGO2, but in the case of just 1 hash differing, only O(P/(2**N)) hashes are enumerated. For example, when split into four slices as shown in FIG. 3A, slices A2 and B2 are the only slice pair whose checksums to not match, then only the hashes that fall into slices A2 and B2 (approximately ¼ of all hashes in the replicas) need to be compared with each other. The comparison of the checksums requires O(2**N) checksums to be transmitted.
Unfortunately, ALGO3 still expands a lot of energy unnecessarily. Specifically, latency-produced deviation is likely to produce many “false positives”, since the checksum is designed to capture any difference with a high probability. In practice, checksums may differ for every slice-to-slice comparison unless the slices are very thin (high N). Since the enumeration of slices is O(2**N), a high N is undesirable and increases latency of the whole replication.
Another problem with ALGO 3 is that, for some areas of the hash ring, a low N would be sufficient, while some areas require a high N. These differences along the ring may arise, for example, because disks that store a certain hash range depopulate that slice that corresponds to that slice when they fail; and because replication itself tends to make the density of the slices un-even. Because the need for depth is not known ahead of time, N is rarely ideally chosen.
Further, Merkle trees are fairly expensive to maintain. A Merkle tree is updated in O(LOG(P)) time and occupies O(P) memory, and the constant factor for memory is quite high since hashes must be kept (e.g. when the checksum is a XOR of the hashes). Merkle trees are also hard to maintain incrementally across chunk stores that are unions. For example, chunk store 102 represents an overlapping union situation, where chunks for the same slice may be on both storage 112 and storage 122. The best way to compute the checksum for a slice under these circumstances is to enumerate the hashes of that slice for storage device in the union.
Consequently, it is desirable to provide techniques for efficiently detecting when and in which slices of the hash ring replicas of the same chunk set cease to match, so that the discrepancies between the replicas can be quickly corrected.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.