In the field of data storage, if a single storage device fails or a portion of its data is damaged, the lost/damaged data is often unrecoverable. A common approach to increase the robustness of data storage is to store data in redundant data storage system, such as a redundant array of independent disks (RAID). Many standard and proprietary RAID architectures, typically hardware implementations, are known. Most RAID implementations distribute the storage of data across multiple disks and include the storage of some type of redundancy data which assists in the recovery of lost or damaged data. Thus, a RAID implementation dedicates a certain amount of overhead storage space to store redundancy data, even though the redundancy data uses space which could store additional data.
Often a large portion of data stored in a redundant data storage system, such as a RAID, is repetitive data. Repetitive data is different than redundant data stored for data protection purposes. Consider an example where an electronic message (“e-mail”) is sent to 100 recipients, it may be stored 100 times in a data storage system. All but the first instance of this e-mail constitute some amount of repetition. In another example, multiple copies of slightly different versions of a word processing document are stored in a data storage system. A large portion of each of the documents is likely to constitute repetition of data stored in conjunction with one or more of the other instances of the word processing document.
Data de-duplication is sometimes used to reduce the amount of repetitive data stored in a data storage system. Presently most data de-duplication is performed in software which executes on the processor of a computer which uses or is coupled with a data storage system. De-duplication often involves hashing data segments to identify duplicate data segments, then replacing an identified duplicate data segment with a smaller reference such as a pointer, code, dictionary count, or the like, which references a data segment, pointer, or the like stored in a de-duplication library. Thus, performing data de-duplication adds additional overhead space which is dedicated to some sort of a de-duplication library. When data de-duplication is performed on a large data set or on very short data segments, a large library can result as well as a heavy processing burden. It is even possible, in some situations, for space dedicated to creating a library to exceed space saved by de-duplication.