In a data storage system, often a large portion of data stored is repetitive data. Repetitive data is data which is potentially unnecessarily duplicated within the data storage system. Consider an example where an electronic message (“e-mail”) is sent to 100 recipients, it may be stored 100 times in a data storage system. All but the first instance of this e-mail constitute some amount of repetition. In another example, multiple copies of slightly different versions of a word processing document are stored in a data storage system. A large portion of each of the documents is likely to constitute repetition of data stored in conjunction with one or more of the other instances of the word processing document.
De-duplication is sometimes used to reduce the amount of repetitive data stored in a data storage system. De-duplication often involves hashing data segments to identify duplicate data segments, then replacing an identified duplicate data segment with a smaller reference such as a pointer, code, dictionary count, or the like, which references a data segment, pointer, or the like stored in or referenced by a de-duplication library or index. In this manner, typically one copy of a duplicated data segment is saved and indexed as a reference, thus allowing other instances of the data segment to be deleted and replaced with a reference or pointer to the indexed data segment. By removing duplicated data in this fashion, storage efficiency can be improved and considerable space can be freed up within a data storage system.
However, if an indexed data segment becomes corrupted, such as due to a media failure or some other reason, the impact of the corruption is not typically limited to the single corrupt data segment. Instead, the scope of the problems caused by the corruption are be multiplied by the number of times that the data segment has been referenced to de-duplicate data segments elsewhere in the data storage system. For example, it is possible for a heavily used or popular data segment to be present in, and thus de-duplicated from, thousands or millions of locations within a data storage system. In such a case, all of the thousands or millions of storage locations which were de-duplicated would become corrupt if the data segment which was referenced to de-duplicate those locations became corrupted.