An enterprise (such as a company, educational organization, government agency, etc.) can maintain one or more storage servers that can store various types of data objects, such as text files, image files, video files, audio files, and/or other types of data. There can be potentially large amounts of duplicative data kept in the storage server(s) of the enterprise, which is wasteful of the storage capacity of the one or more storage servers.
In one example, duplicative data can result from repeated changes made to various files that are maintained as separate versions in the one or more storage servers. Although the different versions of the files are not identical, they still share a lot of common data.
A technique that has been used to reduce storage of duplicative data is to divide data objects into chunks, with a mechanism provided to ensure that certain duplicative chunks are not stored. In the above example, the common chunks shared by the different versions of the files can be stored just once, instead of multiple times in the different files.
An index of keys associated with the data chunks can be maintained to track whether a particular data chunk has already been stored in the storage system. The keys of the index can be hashes computed based on the data chunks. If a particular key is present in the index, then that is an indication that the corresponding data chunk is stored, with high probability, in the storage system.
An issue associated with maintaining an index is that, as the index becomes very large, memory space can run out. As a result, part of the index would have to be stored in slower secondary storage, which can result in thrashing between the memory and secondary storage (in which parts of the index are repeatedly swapped between the memory and secondary storage). Thrashing can slow down performance of the storage server(s).