This disclosure relates to data processing and storage, and more specifically, to management of a data storage system, such as a flash-based data storage system, to optimize data deduplication.
NAND flash memory is an electrically programmable and erasable non-volatile memory technology that stores one or more bits of data per memory cell as a charge on the floating gate of a transistor or a similar charge trap structure. In a typical implementation, a NAND flash memory array is organized in blocks (also referred to as “erase blocks”) of physical memory, each of which includes multiple physical pages each in turn containing a multiplicity of memory cells. By virtue of the arrangement of the word and bit lines utilized to access memory cells, flash memory arrays can generally be programmed on a page basis, but are erased on a block basis.
As is known in the art, blocks of NAND flash memory must be erased prior to being programmed with new data. A block of NAND flash memory cells is erased by applying a high positive erase voltage pulse to the p-well bulk area of the selected block and by biasing to ground all of the word lines of the memory cells to be erased. Application of the erase pulse promotes tunneling of electrons off of the floating gates of the memory cells biased to ground to give them a net positive charge and thus transition the voltage thresholds of the memory cells toward the erased state.
Over thousands of program/erase cycles, the voltage-induced stress on the NAND flash memory cells imparted by the program-erase process causes bit error rates for the data programmed into the NAND flash memory cells to increase over time and thus limits the useful life of NAND flash memory. Consequently, it is desirable to reduce the number of program/erase cycles for NAND flash memory by decreasing the volume of data written into the NAND flash memory through data deduplication (i.e., eliminating storage of duplicate copies of data). In addition, deduplication reduces the cost per effective capacity of flash-based storage systems and can lower the space utilization of a flash-based storage system which in turn reduces the internal data storage overhead such as write amplification.
In general, during the data deduplication process, unique chunks of data (e.g., data blocks or pages) are identified and stored within the NAND flash memory. Other chunks of data to be stored within the NAND flash memory are compared to stored chunks of data, and when a match occurs, a reference that points to the stored chunk of data is stored in the NAND flash memory in place of the redundant chunk of data. Given that a same data pattern may occur dozens, hundreds, or even more than thousands of times (the match frequency may be dependent on a chunk size), the amount of data that must be stored can be greatly reduced by data deduplication.
A data storage system can perform deduplication using either or both of an in-line deduplication process and a background deduplication process. With in-line data deduplication, the data storage system determines if incoming data to be stored duplicates existing data already stored on the storage media of the data storage system by computing a hash (also referred to in the art as a “fingerprint”) of the incoming data and performing a lookup of the hash in a metadata data structure. If a match is found in the metadata data structure, the data storage system stores a reference to the existing data instead of the incoming data. Some deduplication methods may additionally perform a one-to-one comparison of the old and new data. With background deduplication, the data storage system stores all incoming write data to the storage media, and a background process subsequently searches for and replaces duplicate data with a reference to another copy of the data. Background data deduplication can decrease store latency compared to in-line deduplication because a hash computation and lookup to determine duplication of data (and optionally a one-to-one data comparison) do not need to be performed before storing incoming write data. However, implementing background data deduplication typically employs resource-intensive background scanning, and in case the deduplication ratio of the data is greater than one requires a greater storage capacity and causes increased wear on the storage media as compared to data storage systems utilizing in-line deduplication. Conversely, in-line data deduplication requires less data storage capacity and may reduce wear of the storage media, but, if not properly managed, can result in an appreciably higher store latency and in a decreased write bandwidth.
Regardless of whether in-line or background deduplication is employed, the data storage system is required to persistently store (e.g., in NAND flash memory) a large volume of hashes (“fingerprints”) in the metadata data structure(s). In addition, in order to achieve reasonably good performance, data storage systems typically utilize a large amount of dynamic memory (e.g., dynamic random access memory (DRAM)) to enable quick access to the metadata data structures. However, because in real world systems the size of the dynamic memory is necessarily limited, it is typical that portions of the metadata data structures have to be paged in and out from non-volatile storage, reduced in size, or completely dropped, which ultimately negatively impacts overall I/O performance and/or deduplication ratio. Consequently, the appropriate management of fingerprints presents an issue that impacts deduplication performance and thus overall I/O performance.
U.S. Pat. No. 8,392,384B1 discloses one technique for managing fingerprints in which the overall storage volume of fingerprints is managed to fit those fingerprints likely to be accessed into a dynamic memory (i.e., cache). In this approach, fingerprints are classified, via binary sampling, into sampled and non-sampled types only when the cache becomes full, and only non-sampled fingerprints are allowed to be replaced in the cache. In particular, one or more bits of a fingerprint can be used to decide to which type the fingerprint belongs, thereby reclassifying sampled entries into non-sampled ones. In this approach, all fingerprints in the fingerprint index (including those that are cached) correspond to data blocks presently stored in the deduplication storage system, meaning that fingerprints of overwritten (and hence no longer be valid) data are not retained in the fingerprint index.
U.S. Pat. No. 9,069,786B2 discloses another technique for managing fingerprints that utilizes two or more fingerprint lookup tables to store fingerprints. In this approach, a first table stores fingerprints that are more likely to be encountered, and a second (and any additional) tables store fingerprints that less likely to be encountered. Based on this categorization, inline deduplication is performed for those fingerprints likely to be encountered, and background deduplication is performed for those fingerprints less likely to be encountered. In order to determine which tables should be searched, attributes indicating how much effort to put into inline deduplication are associated with data chunks or groups of data chunks.