Scalability is an important requirement in many data storage systems, particularly in network-oriented storage systems such as network attached storage (NAS) systems and storage area network (SAN) systems. Different types of storage systems provide diverse methods of seamless scalability through storage capacity expansion. In some data storage systems, it is possible to add capacity by “virtualization.” In this type of system, multiple storage servers are utilized to field input/out (I/O) operations (i.e., reads and writes) independently, but are exposed to the initiator of the I/O operation as a single device, called a “storage cluster.” Each storage server in a cluster is called a “storage node”, a “data node” or just a “node.” In a data storage cluster, the multiple data nodes can provide distributed storage of data. When available data storage capacity becomes low, a new server may be added as a new node in the data storage system. In addition to contributing increased storage capacity, the new storage node contributes other computing resources to the system, leading to true scalability. This methodology is known as “horizontal capacity expansion.”
A process used in many storage systems that can affect scalability is data deduplication. Data deduplication is an important feature for data storage systems, particularly for distributed data storage systems. Data deduplication is a technique to improve data storage utilization by reducing data redundancy. A data deduplication process identifies duplicate data and replaces the duplicate data with references that point to data stored elsewhere in the data storage system. A deduplication technique that works well with a data storage system containing a large number of nodes and allowing adding new nodes is desirable. However, existing deduplication techniques for distributed storage systems suffer certain deficiencies as discussed below.
One technique for deduplicating data in a distributed storage system is inline deduplication. Inline deduplication deduplicates data before it is stored to long-term storage (e.g., disks), also called primary storage. This techniques works well for certain workloads such as backup streams, but tends to be far less effective when used with typical primary storage. Also, the node that performs inline deduplication tends to become a central bottleneck. Inline deduplication also normally requires a complete copy of a “chunk map” on each node that performs inline deduplication. A chunk map (also referred to as chunk data structure) in this context is a data structure that contains associations between identifier (IDs) of data chunks stored in the system and “fingerprints” of the data chunks. A “data chunk” is a contiguous portion of a data object. Fingerprints are unique values generated by a hashing algorithm, which can be used by a compare operation to detect possible duplicate data chunks quickly. A complete copy of the chunk map is needed on every node that does inline deduplication. The copies of chunk maps on separate nodes need to be synchronized frequently to avoid accidental data loss and inconsistent deduplication. Thus, for a system having a large number of nodes, the frequent needs of high volume synchronizations negates the potential benefits of having multiple copies of chunk maps on separate nodes. Therefore, it is difficult to scale in-line deduplication to a system or cluster containing a large number of nodes.
Content addressing is another technique used in some distributed storage systems to facilitate deduplication. Content addressing routes data blocks to specific nodes, based on the hashes of the contents in the data blocks. In a distributed system using content address, the storage load is often balanced across the nodes in the system; and each node is assigned to store data for a specific range of hashes (addresses). When a new node is added into the system, the storage load needs to be re-balanced. This re-balancing results in high volumes of network traffic. Further, nodes in the system can be overloaded from operations including negotiating which range of addresses each node stores, moving relevant data to the new locations, deleting data from its old location, and updating metadata. Therefore, this technique leads to considerable inter-node data transfer traffic and therefore poor scalability when new nodes are added to the distributed storage system.