A storage server can include one or more storage devices that enable users to store and retrieve information. A storage server can also include a storage operating system that functionally organizes the system by invoking storage operations in support of a storage service implemented by the system. They may be implemented using various storage architectures, such as network-attached storage (NAS), storage area network (SAN), or a disk assembly directly attached to a client or host computer. The storage devices are typically disk drives organized as a disk array, but can be other types of devices, such as solid-state drives or flash memory.
In many industries, such as banking, government contracting, and securities, selected data must be stored in an immutable manner for long periods of time. Typically, storage servers use data backup (e.g., to electronic tape media) to ensure that the data is protected in the event of a hardware failure. Tape backup has several disadvantages, including slow data access and often the requirement that the backup administrator manage a large number of physical tapes. As an alternative, several storage server vendors provide virtual tape library (VTL) systems that emulate tape storage devices using multiple disk drives. In a typical VTL environment, the primary storage server performs a complete backup operation of the storage server's file system (or other data store) to the VTL system. Often, the data being backed up changes very little between backups. This duplication can waste significant amounts of storage space. Some VTL systems also perform replication, in which the data being backed-up is mirrored to a remote storage server rather than stored on a local storage device. For these systems, the data duplication results in duplicated data being unnecessarily mirrored to the remote system, wasting network resources.
Existing techniques for reducing data duplication (“de-duplication”) have significant disadvantages. In general, de-duplication is performed by detecting blocks of data that are repeated within a single backup or in multiple backups stored by the data system. For any specific sequence of data, the VTL system can replace other instances of the same data with a reference to a single copy of the data. The single copy may be located within the backup or stored separately in a database. This technique may be used to reduce the size of the backup before it is stored on the disk or replicated to a separate mirror server.
A key challenge for de-duplication is detecting duplicated blocks of data. Systems cannot simply compare every possible data block, because the number of comparisons would be extremely large. To reduce the complexity to a manageable level, some backup systems use data “fingerprints” or hashes to reduce the amount of data to be compared. A data fingerprint is a value (e.g., a bit string) generated from an arbitrarily large data set using a fingerprinting algorithm. The fingerprinting algorithm can be, for example, a hashing algorithm such as SHA-1, SHA-256, or SHA-512. If two data sets are different, the fingerprinting algorithm will produce different fingerprints.
Some techniques use fixed size data blocks to generate the data fingerprints. As a data set is received, the backup system generates a data fingerprint for each fixed size block received (e.g., for each 16 KB block of data). The system then compares each data fingerprint to a database of stored fingerprints to detect duplicate blocks. An advantage of this technique is its simplicity—the system only performs one fingerprint operation for each data block. However, this method does not work well if data is added or deleted from a data block in between backups of a storage device. For example, if a single section of data is inserted in the middle of a data set that has previously been backed up, the data after the insertion in the data set will be shifted relative to the data blocks from the previous backup. Even though the data after the insertion is not changed, the duplication will not be detected because the data is divided into blocks differently in the second data set.
Some de-duplication techniques attempt to solve this problem by using variable sized data blocks or rolling hashes to generate data fingerprints. For example, some systems evaluate multiple window sizes based on a single starting point and select a window size based on a comparison function. However, these techniques tend to be computationally intensive and difficult to execute with reasonable efficiency. In particular, these techniques require that the system calculate a number of hashes and/or perform a large number of comparison operations.