Data is often stored in storage systems that are accessed via a network. Network-accessible storage systems allow potentially many different client devices to share the same set of storage resources. A network-accessible storage system can perform various operations that render storage more convenient, efficient, and secure. For instance, a network-accessible storage system can receive and retain potentially many versions of backup data for files stored at a client device. As well, a network-accessible storage system can serve as a shared file repository for making a file or files available to more than one client device.
Some data storage systems may perform operations related to data deduplication. In computing, data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data. Deduplication techniques may be used to improve storage utilization or network data transfers by effectively reducing the number of bytes that must be sent or stored. In the deduplication process, unique chunks of data, or byte patterns, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and a redundant chunk may be replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times, the amount of data that must be stored or transferred can be greatly reduced. The match frequency may depend at least in part on the chunk size. Different storage systems may employ different chunk sizes or may support variable chunk sizes.
Deduplication differs from standard file compression techniques. While standard file compression techniques typically identify short repeated substrings inside individual files, storage-based data deduplication involves inspecting potentially large volumes of data and identify potentially large sections—such as entire files or large sections of files—that are identical, in order to store only one copy of a duplicate section. In some instances, this copy may be additionally compressed by single-file compression techniques. For example, a typical email system might contain many instances of the same one megabyte (MB) file attachment. In conventional backup systems, each time the system is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, the storage space required may be limited to only one instance of the attachment. Subsequent instances may be referenced back to the saved copy for deduplication ratio of roughly 100 to 1.