This disclosure relates to data processing and storage, and more specifically, to management of a data storage system, such as a flash memory system, to avoid unnecessary hash computations during data deduplication.
NAND flash memory is an electrically programmable and erasable non-volatile memory technology that stores one or more bits of data per memory cell as a charge on the floating gate of a transistor or a similar charge trap structure. In a typical implementation, a NAND flash memory array is organized in blocks (also referred to as “erase blocks”) of physical memory, each of which includes multiple physical pages each in turn containing a multiplicity of memory cells. By virtue of the arrangement of the word and bit lines utilized to access memory cells, flash memory arrays can generally be programmed on a page basis, but are erased on a block basis.
As is known in the art, blocks of NAND flash memory must be erased prior to being programmed with new data. A block of NAND flash memory cells is erased by applying a high positive erase voltage pulse to the p-well bulk area of the selected block and by biasing to ground all of the word lines of the memory cells to be erased. Application of the erase pulse promotes tunneling of electrons off of the floating gates of the memory cells biased to ground to give them a net positive charge and thus transition the voltage thresholds of the memory cells toward the erased state. Each erase pulse is generally followed by an erase verify operation that reads the erase block to determine whether the erase operation was successful, for example, by verifying that less than a threshold number of memory cells in the erase block have been unsuccessfully erased. In general, erase pulses continue to be applied to the erase block until the erase verify operation succeeds or until a predetermined number of erase pulses have been used (i.e., the erase pulse budget is exhausted).
A NAND flash memory cell can be programmed by applying a positive high program voltage to the word line of the memory cell to be programmed and by applying an intermediate pass voltage to the memory cells in the same string in which programming is to be inhibited. Application of the program voltage causes tunneling of electrons onto the floating gate to change its state from an initial erased state to a programmed state having a net negative charge. Following programming, the programmed page is typically read in a read verify operation to ensure that the program operation was successful, for example, by verifying that less than a threshold number of memory cells in the programmed page contain bit errors. In general, program and read verify operations are applied to the page until the read verify operation succeeds or until a predetermined number of programming pulses have been used (i.e., the program pulse budget is exhausted).
A Cyclic Redundancy Check (CRC) is an error detecting code commonly used in storage devices to detect accidental changes in data. In implementation, a data set to be stored has a calculated CRC value attached that is based on a remainder of a polynomial division of a content of the data set. On retrieval of the data set from a storage device, calculation of a CRC value is repeated and corrective action can then be taken against presumed data corruption if CRC values do not match.
In computing, data deduplication is a technique for eliminating duplicate copies of data. Data deduplication is used to improve storage utilization and can also be applied to network data transfers to reduce a number of bytes transmitted. In the deduplication process, unique chunks of data (e.g., data pages) are identified and stored during a process of analysis. As the analysis continues, other chunks of data are compared to stored chunks of data and when a match occurs the redundant chunk of data is replaced with a reference that points to the stored chunk of data. Given that a same byte pattern may occur dozens, hundreds, or even thousands of times (e.g. a match frequency may be dependent on a chunk size), the amount of data that must be stored or transferred can be greatly reduced. For example a typical email system may contain one-hundred (100) instances of the same one (1) MB file attachment. Each time the email system is backed up, all one-hundred (100) instances of the attachment may be stored, requiring one-hundred (100) MB of storage space. When data deduplication is implemented, only one instance of the attachment is actually stored and subsequent instances are referenced to the stored instance. In general, storage-based data deduplication reduces the amount of storage needed for a given data set.
In-line data deduplication has conventionally performed deduplication in real-time hash computations as data enters a storage system. When a storage system receives new data, the storage system determines if the new data corresponds to existing data that is already stored and, if so, the storage system references the existing data and does not store the new data. With background data deduplication, new data is first stored on the storage system and then a background process is initiated at a later point-in-time to search for duplicate data. A benefit of background data deduplication is that there is no need to wait for hash computation and lookup to be completed before storing incoming data, thereby ensuring that storage system performance is not degraded. A drawback of background data deduplication is that duplicate data is stored, which may be an issue if a storage system is near full capacity. A benefit of in-line data deduplication over background data deduplication is that in-line data deduplication requires less storage, as data is not duplicated in the storage system. However, given that hash computations and lookups may take a relatively long time period to perform, data ingestion for in-line data deduplication can be slower than background data deduplication, thereby reducing write throughput of a storage system. Storage systems supporting deduplication typically implement one of these two techniques or a combination thereof.
Conventional storage systems have usually performed unnecessary hash computations during data deduplication, unnecessarily degrading their performance.