Field of the Invention
The present invention relates to storage systems and, more specifically, to a technique for efficiently reducing duplicate data in a storage system.
Background Information
A storage system typically includes one or more storage devise into which information may be entered, and from which information may be obtained, or desired. The storage system may logically organize the information stored on the devices as storage containers, such as logical units or files. These storage containers may be accessed by a host system using a protocol over a network connecting the storage system to the host.
The storage system may typically retain a plurality of copies of similar data (e.g., duplicate data). Duplication of data may occur when, for example, two or more files store common data or where data is stored at multiple locations within a file. The storage of such duplicate data increases the total consumption of storage space utilized by the storage system and may cause administrators to expand the physical storage space available for use by the system, thereby increasing costs to maintain the storage system. As such, data deduplication techniques may be implemented to save storage space and reduce costs.
A prior technique for data deduplication utilizes a fingerprint database that is implemented as a flat file storing a list of fingerprints as an array, wherein each element of the array is a fingerprint entry. A fingerprint may be, for example, a hash or checksum value of a fixed size block of data (e.g., 4 kilobytes). The array is then utilized to perform data deduplication operations. Specifically, the fingerprint database is traversed entirely, from beginning to end, and existing fingerprints stored in the database are compared with a batch of new fingerprints associated with new blocks of data. A merge-sort operation may then be performed to identify duplicate fingerprints and remove duplicate data.
A disadvantage associated with the above technique is that there may be substantial overhead (e.g., reading and writing) associated with performing the deduplication operations. Specifically, for each deduplication operation, the entire existing fingerprint database is read from beginning to end, and at the completion of the deduplication operations, the entire fingerprint database (e.g., flat file) is overwritten. Additionally, since the database is embodied as a flat file, there is typically no means (e.g., no data structure) to facilitate lookup operations within the file. Therefore, the deduplication operations are essentially implemented as operations that do not have a means to perform indexing on the fingerprint database.