A storage system may include one or more storage devices, such as disks, into which information may be entered, and from which information may be obtained, as desired. A storage operating system of the storage system may implement a high-level module, such as a file system, to logically organize the information stored on the disks as a hierarchical structure of storage containers, such as directories, files and/or aggregates having one or more volumes that hold files and/or logical units (LUNs). For example, each “on-disk” file may be implemented as set of data structures, i.e., disk blocks, configured to store information, such as actual data for the file. These data blocks may be organized within a physical volume block number (PVBN) space of the aggregate that is maintained by the file system. Each file system block in the PVBN space may have a one-to-one mapping with an on-disk block within a disk block number (DBN) space.
The storage system may typically retain a plurality of copies of similar data (e.g., duplicate data). Duplication of data may occur when, for example, two or more files store common data or where data is stored at multiple locations within a file. The storage of such duplicate data increases the total consumption of storage space utilized by the storage system and may cause administrators to expand the physical storage space available for use by the system, thereby increasing costs to maintain the storage system. As such, data de-duplication techniques may be implemented to save storage space and reduce costs.
A prior approach for data de-duplication may utilize a fingerprint database that is implemented as a flat file storing a list of fingerprints as an array, wherein each element of the array is a fingerprint entry. A fingerprint may be, for example, a hash or checksum value of a fixed size block of data (e.g., 4 kilobytes). The array may then be utilized to perform data de-duplication operations. Specifically, the fingerprint database may be traversed entirely, from beginning to end, and existing fingerprints stored in the database may be compared with a batch of new fingerprints associated with new blocks of data. A merge-sort operation may then be performed to identify duplicate fingerprints and remove duplicate data.
A disadvantage associated with the prior approach is that there may be substantial overhead (e.g., reading and writing) associated with performing the de-duplication operations. That is for each de-duplication operation, the entire existing fingerprint database may be read from beginning to end, and at the completion of the de-duplication operation, the entire fingerprint database (e.g., flat file) may be overwritten. Additionally, since the database is embodied as a flat file, there may be no means to facilitate lookup operations within the file.