Data deduplication is an intelligent compression technique that reduces storage costs by eliminating duplicate copies of data. Data deduplication may be used to improve storage utilization. During a deduplication process, unique segments of data are identified and stored on disk. A hashing function generates a checksum on the unique segment of data, and the checksum are stored in a table. The checksum table is referred to herein as a dictionary or dedup dictionary. Before data is written to a disk, the dedup dictionary is consulted to determine whether there is a duplicate of the data that is to be written.
Several techniques and optimizations may be used to maintain a dedup dictionary. Some of the conventional techniques include a dedup dictionary that has a preallocation of the amount of memory used by the dictionary. A disadvantage of this technique is that the lookups into the dictionary are limited to the amount of memory that has been reserved, and as a data set grows, the dictionary may exceed the reserved memory. Another conventional technique is to use flash memory for data deduplication. This conventional technique, however, may require deduplication logic to do several input/output (I/O) operations to the flash memory to determine whether the dedup library contains a duplicate key, and a central processing unit (CPU) is needed to generate the hash. Both conventional techniques add to the latency of I/O operations in the form of multiple reads and writes when there are no collisions. If the data has not ever been written to the disk, then determining the hash and writing the data to the disk involves additional I/O operations.