This disclosure relates to selecting a storage node in a storage system to store data and, more specifically, to techniques for improving deduplication efficiency in a storage system with multiple storage nodes.
In computing, data deduplication is a technique for eliminating duplicate copies of data. Data deduplication is used to reduce storage utilization and can also be applied to network data transfers to reduce a number of bytes transmitted. In the deduplication process, unique blocks of data (i.e., data chunks) are identified and a hash of their content is stored during a process of analysis. As the analysis continues, hashes of other chunks of data are compared to the stored hashes and when a match occurs the redundant chunk of data is replaced with a reference that points to the stored chunk of data. The content of the two chunks of data may also be compared to guarantee that the hash comparison did not produce a false positive. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (e.g., a match frequency may be dependent on a chunk size), the amount of data that must be stored or transferred can be greatly reduced. For example a typical email system may contain one-hundred (100) instances of the same one (1) megabyte (MB) file attachment. Each time the email system is backed up, all one-hundred (100) instances of the attachment may be stored, requiring one-hundred (100) MB of storage space. When data deduplication is implemented, only one instance of the attachment is actually stored and subsequent instances are referenced to the stored instance. In general, storage-based data deduplication reduces the amount of storage needed for a given data set.
In-line data deduplication storage systems conventionally perform deduplication in real-time by eliminating duplicate data copies as the data enters a storage system and before it is written to a storage medium. When a storage system receives new data, the storage system determines if the new data corresponds to existing data that is already stored and, if so, the storage system references the existing data and does not store the new data. With background data deduplication, new data is first stored on the storage system and then a background process is initiated at a later point-in-time to search for duplicate data. A benefit of background data deduplication is that there is no need to wait for hash computation and lookup to be completed before storing incoming data, thereby ensuring that storage system performance is not degraded. A drawback of background data deduplication is that duplicate data is stored, which may be an issue if a storage system is near full capacity. A benefit of in-line data deduplication over background data deduplication is that in-line data deduplication requires less storage, as data is not duplicated in the storage system. However, given that hash computations and lookups may take a relatively long time period to perform, data ingestion for in-line data deduplication can be slower than background data deduplication, thereby reducing write throughput of a storage system. Storage systems supporting deduplication typically implement one of these two techniques or a combination thereof and may perform data deduplication at a global level or at a storage node level.
Storage systems are sometimes composed of several storage nodes in order to scale performance and storage capacity to match the requirements of many application classes. Multi-node storage systems facilitate incrementally adding storage nodes to ensure that the cost of a storage system is proportional to the needs of a user. In general, distributed storage systems offer much higher throughput as they aggregate the performance of many storage nodes. Moreover, multi-node storage systems can accommodate more classes of applications with different performance requirements.
In a storage system with several storage nodes, deduplication can either be performed globally (all copies of a chunk are eliminated) or locally (only the copies of a block local to a storage node are eliminated). Global deduplication, although resulting in a higher data reduction than local deduplication, is more difficult to implement as the storage system has to manage a very large amount of globally distributed metadata. Storage systems that perform data deduplication at a storage node level usually experience lower deduplication efficiency than storage systems that have performed data deduplication at a global level. However, local deduplication has a simpler and more efficient design, as much less metadata has to be maintained.