The present invention relates to distributed data storage systems, and more specifically to a method, system and computer program product for performing data deduplication for an eventually consistent distributed data storage system, where clients can read, write and delete data without any coordination between them, and the implementation is lock-free. Without loss of generality we explain our method for a distributed object storage system; it also applies to other distributed systems, e.g., for block and file storage.
Methods and systems exist for data deduplication in a distributed data storage system. Such a distributed data storage system typically comprises a plurality of data storage devices such as, e.g., servers with direct attached storage (e.g., disks), connected together in some type of network and could be located in a cloud. Such a system also commonly maintains multiple copies (replicas) of its data on a plurality of the servers (e.g., redundant data) so as to make the data more durable and less likely to be lost in the event of failure. Without loss of generality these copies could be erasure coded.
When a new version of an object is written and stored in a distributed storage system, it needs to be propagated to all of its replicas. Furthermore, there may also be storage metadata that needs to be propagated and/or updated. However, this propagation takes time and does not occur instantaneously. Thus, there may be a period of time (albeit usually relatively small) in which one or more replicas will have the new data while the other replicas may not be created or hold an older or previous version of the data. Thus, two clients that read the object at the same time may not see the same value. Eventually, the data will propagate to all of the replicas within the distributed data storage system such that the replicas will be consistent (hence, the term “eventually consistent”). The motivation for building such eventually consistent storage systems is the CAP theorem, which states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: consistency (all nodes see the same data at the same time), availability (a guarantee that every request receives a response about whether it succeeded or failed), and partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures).
Data deduplication generally refers to a method that reduces the amount of data storage space needed to store data. Various methods of data deduplication exist. For example, different storage objects may contain identical content. Storing this duplicate data separately for each object is inefficient as it results in an excess amount of data storage space being utilized to store the same content.
Instead, data deduplication stores a piece of content once. Typically data deduplication employs a cryptographic hash function to identify duplicate content (with extremely high probability two pieces of content have the same hash only if they are identical) and maintains a dictionary of the content that has already been stored. When new data is written, the hash of its content is checked against the dictionary to see if the content is new. If new, a new content entity is created and a new entry is made for it in the dictionary. If a duplicate, an indication is made (e.g., a reference count increased), and a pointer or some other identifier is used to reference that content. The data deduplication method typically may take place on an object level, on the file level or on a finer grain data block level. The data pointer or other identifier usually takes up far less storage space than the piece of data itself. As a result, use of a data deduplication method can result in the saving of a relatively large amount of data storage space in a distributed data storage system. For example, consider a storage system for email attachments. In a deduplicated system, the content of a particular attachment might be stored once as (with appropriate redundancy for that content object), rather than once for each time it was sent in an email. The calculation of the hash and/or the detection of duplication may occur on the client side or in the storage system itself.
Sometimes it may be desired to delete a piece of data that has been previously deduplicated within a distributed data storage system, for example when that piece of data is no longer being referenced by any client. However, a potential issue with the deletion of deduplicated data within an eventually consistent distributed data storage system is that a race condition may occur in which it appears that no client is attempting to reference a particular piece of data while the system is in the process of deleting that particular piece of deduplicated data. However, in reality a client is indeed simultaneously in the process of attempting to reference that particular piece of data. That is, two conflicting operations are being attempted to be carried out at the same time on the particular piece of data (i.e., both the deletion of that data and access to that data). As a result, that particular piece of deduplicated data cannot be safely deleted from the distributed data storage system.
What is needed is an eventually consistent distributed data storage system that utilizes a data deduplication method, which allows for the safe deletion of deduplicated data. It is also desirable to allow for the avoidance of sending data content over the network (“over-the-wire”) into the system when that data content already exists in the distributed data storage system.