As more and more data is generated by computers, the need for efficient and reliable data backup and storage systems is increasing. There are a variety of systems in existence today, utilizing both local and network storage for backup.
In some of the existing data storage systems, a data stream is segmented and then stored. The data segments are identified by content derived identifiers derived from the data in the data segments. Examples of content derived identifiers include fingerprints generated by hash algorithms and checksums calculated based on the data segments. The content derived identifiers are stored, and used for retrieving the data segments when the original data stream is reconstructed. Since identical data segments result in the same identifier, a duplicate data segment can be identified and prevented from being stored again; instead, the content derived identifier is stored every time a duplicate data segment appears, thereby making the system more efficient.
Because the content derived identifiers usually have fewer bits than the corresponding data segments, it is possible for two non-identical data segments to have the same identifier, resulting in a collision that may corrupt the data stream. The probability of collision can be lowered by generating identifiers that populate a wider range of values. Thus, it is quite common for systems to use identifiers that are 16–24 bytes or even longer. Since the same identifier is stored multiple times for identical data segments in the data stream and takes up space, the system's efficiency is typically lower if longer identifiers are used. It would be desirable to have a data storage technique that would more efficiently utilize the available storage space, without compromising the accuracy or speed of the system.