Some companies rely on data deduplication techniques to reduce certain costs associated with data. For example, in some instances a user may desire to store a copy of a file to a second location on their computer. Conventionally, this may cause a second copy of the file to be created and stored on the user's computer. Instead of storing the file twice, a reference may be stored pointing to the original file, thereby reducing the amount of data stored on the computer. Additionally, some computers may be able to reduce bandwidth usage using data deduplication techniques. If a file that a computer has been instructed to acquire from an external device already exists on the computer, the computer may determine that it does not need to download a new copy of the file. While files are described, some conventional data deduplication systems typically operate on less than file sized chunks of data.
Conventional data deduplication logics sometimes rely on hashes of blocks of data to distinguish blocks of data instead of directly comparing entire blocks of data. Some conventional hashing algorithms ensure that it is very uncommon for two blocks of data to hash to the same value. For example, a 128 bit hash could map blocks of data to up to 2128 or over 3.4×1038 different values. This may allow a data deduplication logic to index a large number of blocks of data using hashes without having to worry about collisions. In the above 128 bit hash example, depending on the hashing algorithm, a collision where two blocks of data have hashed to the same value is about 50% likely to have occurred once 2.2*1019 different blocks have data have been hashed. This means that when hashes for two different blocks of data match, it is very likely that the two blocks of data contain the same data.
To match hashes to actual blocks of data in memory, some conventional data deduplication techniques employ an index to point from hashes to locations in memory that contain blocks of data from which the hashes were generated. Depending on the number of blocks of data that have been indexed, the index may be very large. In some cases the index may be stored on the same local storage device (e.g., a hard disk) as indexed data. However, data stored in a computer's local storage device takes longer to retrieve than data stored in the computer's random access memory (RAM). This makes accessing the index a relatively slow operation. However, because RAMs typically store much less data than a local storage device, it is sometimes difficult to store the entire index in RAM. This may result in scalability, processing, and retention limitations when some data deduplication techniques are employed using a limited amount of RAM.