Enterprise-scale backup operations often involve many different types of backup jobs or workloads. For regular backup workloads where an entire data corpus or data set of the backup is sent to a backup storage device (also called a backup appliance), a tree of metadata segments is often created that points to actual data segments. Typically a metadata segment covers a large portion of data segments, so the overhead of metadata for a copy of backup whose entire data is sent to the backup appliance (e.g., EMC's DDR system) is very small, and usually less than 0.05% of total backup size, for an average backup session. However, in newer formats of backups where only changed data (<1% of total backup size) is sent to an appliance but a full copy needs to be represented, the overhead of updating metadata and storing it can be extremely expensive. This overhead also makes it prohibitive to cache the metadata in solid-state disk (SSD) type of devices that wear out quickly with high churn.
Issues related to present backup solutions thus include high space cost per backup for high frequency backups and the high metadata overhead per snapshot, and a high cost for identifying changes across backups in terms of I/O operations, for the purposes of incremental replication, file-verification, and restores from the backup.
At present, the most common use cases of incremental forever workloads are LUN (Logical Unit Number) or VM (virtual machine) backups using Change Block Tracking (CBT) technology and virtual synthetic backups for file systems. In these workloads, the metadata updates could be as expensive as data update itself. Hence, efficiency in updating and storing metadata becomes very critical. With regard to identifying changed parts of a file system for replication, verification and restore operation, known solutions rely on differencing (“diffing”) the file system tree representations, and such methods incur a great deal of overhead in walking through different versions of the file system tree and cataloguing the difference data.
What is needed, therefore, is a system and method improves the performance of large-scale backup operations by minimizing the processing and storage of metadata updates. What is further needed is a method and system of using a sparse metadata segment tree to facilitate efficient backup operations in evolving backup workloads.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions. EMC, Data Domain, Data Domain Restorer, and Data Domain Boost are trademarks of EMC Corporation.