This description relates to managing and indexing deduplicated data.
Some data storage systems are configured to include a deduplication function that is used to reduce the amount of storage capacity that is needed to store received data (e.g., data to be stored in the data storage system). In some implementations, deduplication works by segmenting received data into segments (also called “chunks”) of data that are identified in an index by a value, such as a cryptographic hash value. A form of data compression can be achieved by preventing duplicate segments from being stored when the data is being stored in the data storage system. For example, a given file (made up of one or more segments) that has already been stored (e.g., an email attachment attached to multiple emails in an email storage system) can simply be replaced with a reference to the previously stored file if the previously stored file has the same segments. Alternatively, a given segment within a given file that is the same as another segment in the given file or another file (e.g., a portion of document within a ZIP archive that is also stored in another ZIP archive) can be replaced with a reference to the duplicate segment.