In the past decade, many backup environments have evolved from tape-centric to disk-centric. Disk-based backup grew out of the desire to have backup data online and immediately accessible, rather than spread across numerous tapes that had to be located, mounted, and sequentially accessed in case of data loss. Deduplication was used to reduce the capacity requirements of the backup system, in order to permit disk-based backup to compete financially with tape. The most common type of deduplication breaks a data stream into “chunks,” using features of the data to ensure that most small changes to the data do not affect the chunk boundaries. This way, inserting a few bytes early in a file might change the chunk where the insertion occurs, but the rest of the file will be deduplicated.
With tapes, a backup server can identify a pool of completely equivalent tape drives on which to write a given backup. When data were ready to be written, the next available tape drive would be used. Capacity for backup would be a non-issue, since it would be trivial to buy more magnetic tape. The main constraint in sizing the backup environment would be ensuring enough throughputs across the backup devices to meet the “backup window,” i.e., the time in which all backups must complete.
With deduplicated disk backups, this equivalence across appliances is lost. Writing to the same storage system gains efficiencies by suppressing duplicate data; these efficiencies can be further reflected back to the backup server or even the client being backed up, if the duplicates are identified before data cross the network.
The “stickiness” of the assignment of a client to a storage appliance changes the role of the backup administrator. Instead of simply pooling many clients across many tape drives and worrying only whether all the backups needed on a given night can make their way through the system in time, the mapping of clients to storage appliances needs to be done a priori. Once a client has been paired with a particular storage appliance, it gets great benefits from returning to that appliance and omitting duplicates. Should it move to a different appliance, it must start over, writing all of its data anew. But if its target appliance is overloaded, it queues up and waits longer than desired, possibly causing the backup not to complete within its “backup window.”
Capacity is similarly problematic, since a client that is being backed up onto a full storage appliance either is not protected or must move to another less loaded system and pay a cost for copying data that would otherwise have been suppressed through deduplication. In summary, once a client is backed up onto a particular storage appliance, there is a tension between the benefits of continuing to use it and the disadvantages that may ensue from overload; at some tipping point, the client may move elsewhere. It then pays a short-term overhead (lack of deduplication) but gets long-term benefits.
Because of the additional complexity of data movement overheads and the need to estimate capacity requirements in light of content overlap, it is beneficial to automate the assignment of clients to deduplicating backup storage appliances.
Within a backup environment, a number of “clients” are backed up onto a backup storage appliance, which can perform deduplication by identifying chunks that already exist on the appliance. When a client writes duplicate data, a duplicate is most commonly a piece of a file that the same client previously backed up, but it can also be due to overlap with data from a different client backed up onto that appliance. Clients can use a backup protocol that enables them to send the backup appliance only the “fingerprints” that uniquely identify chunks, determine which chunks are not yet stored on the appliance, and send the full content for those chunks.
A similar situation exists when these backup storage appliances are replicated to other backup storage appliances for added redundancy and reliability. One deduplicating backup storage appliance can send another such appliance a set of fingerprints, identify which fingerprints correspond to chunks not already stored by the other appliance, and send the missing chunks. Additionally, techniques such as “delta replication” can be used to further reduce the total amount of data transferred between replicas, by sending commands to recreate a chunk from an existing chunk.
When there are many backup storage appliances being replicated to two or more other backup storage appliances, each appliance being replicated has a choice of replication targets. The choice of target depends on several factors such as capacity, throughput, and other content that already exists on the replication targets or will exist there in the future.