Many information technology (“IT”) operations and activities can be scheduled to run one or more times within some periodic cycle (daily, weekly, monthly, quarterly, etc.). One such application can be data backup. Data backups can be essential to preserving and recovery of data in the event of data loss, for example. To avoid interfering with daily user activities, data backups can be performed during periods of low application server utilization, typically, on weeknights and on weekends. The backup job workload can be the same or different depending on how much data needs to be protected and when. In some applications, backup jobs can be scheduled and/or configured using a commercial backup application, an operating system shell scripting, and/or in any other manner.
Backup application employ a plurality of techniques to manage data designated for backup. One such technique includes deduplication. Deduplication can be used to eliminate redundancy in the execution of periodically executed backup tasks. In some cases, deduplication can reduce data storage capacity consumption as well as an inter-site network bandwidth. It can do so by identifying and eliminating similar and/or identical sequences of bytes in a data stream. Deduplication can also include computation of cryptographic and/or simple hashes and/or checksums, as well as one or more forms of data compression (e.g., file compression, rich media data compression, delta compression, etc.).
Deduplication involves identifying similar or identical patterns of bytes within a data stream, and replacing those bytes with fewer representative bytes. By doing so, deduplicated data consumes less disk storage capacity than data that has not been deduplicated and when the data stream must be transmitted between two geographically separate locations, consumes less network bandwidth. Adaptive deduplication strategies combine inter-file and/or intra-file discovery techniques to achieve the aforementioned goals.
Deduplication can be used to reduce the amount of primary storage capacity that is consumed by email systems, databases and files within file systems. It can also be used to reduce the amount of secondary storage capacity consumed by backup, archiving, hierarchical storage management (HSM), document management, records management and continuous data protection applications. In addition, it can be used to support disaster recovery systems which provide secondary storage at two or more geographically dispersed facilities to protect from the total loss of data when one site becomes unavailable due to a site disaster or local system failure. In such a case, deduplication helps to reduce not only the amount of data storage consumed, but also the amount of network bandwidth required to transmit data between two or more facilities.
Conventional deduplication techniques apply one level of deduplication to backup streams, and do not take advantage of additional deduplication reductions when compared with multi-level deduplication. Such techniques are typically limited to optimization of bandwidth or capacity at one level, but do not provide optimization at the higher levels and thus, do not provide the requisite space or bandwidth savings. In that regard, such systems tend to consume a significant amount of network bandwidth and storage capacity, thereby increasing operational costs and reducing efficiency of networks and data storage facilities. Thus, there is a need for a deduplication mechanism that is capable of providing multi-level deduplication of data zones within an incoming data stream as well as improving the deduplication ratio.