Many information technology (“IT”) operations and activities can be scheduled to run one or more times within some periodic cycle (daily, weekly, monthly, quarterly, etc.). One such application can be data backup. Data backups can be essential to preserving and recovery of data in the event of data loss, for example. To avoid interfering with daily user activities, data backups can be performed during periods of low application server utilization, typically, on weeknights and on weekends. The backup job workload can be the same or different depending on how much data needs to be protected and when. In some applications, backup jobs can be scheduled and/or configured using a commercial backup application, an operating system shell scripting, and/or in any other manner.
Backup applications employ a plurality of techniques to manage data designated for backup. One such technique includes deduplication. Deduplication can be used to eliminate redundancy in data stream created during the execution of periodically executed backup tasks. In some cases, deduplication can reduce data storage capacity consumption as well as an inter-site network bandwidth. It can do so by identifying and eliminating similar and/or identical sequences of bytes in a data stream. Deduplication can also include computation of cryptographic and/or simple hashes and/or checksums, as well as one or more forms of data compression (e.g., file compression, rich media data compression, delta compression, etc.).
Deduplication involves identifying similar or identical patterns of bytes within a data stream, and replacing those bytes with fewer representative bytes. By doing so, deduplicated data consumes less disk storage capacity than data that has not been deduplicated and when the data stream must be transmitted between two geographically separate locations, consumes less network bandwidth. Adaptive deduplication strategies combine inter-file and/or intra-file discovery techniques to achieve the aforementioned goals.
Deduplication can be used to reduce the amount of primary storage capacity that is consumed by email systems, databases and files within file systems. It can also be used to reduce the amount of secondary storage capacity consumed by backup, archiving, hierarchical storage management (“HSM”), document management, records management and continuous data protection applications. In addition, it can be used to support disaster recovery systems which provide secondary storage at two or more geographically dispersed facilities to protect from the total loss of data when one site becomes unavailable due to a site disaster or local system failure. In such a case, deduplication helps to reduce not only the amount of data storage consumed, but also the amount of network bandwidth required to transmit data between two or more facilities.
Conventional deduplication-based data storage systems perform site-wide deduplication by using a single compute server that is responsible for deduplicating all data stored on one or more simple disk storage units that have no deduplication processing capability. However, these deduplication systems typically suffer from availability issues, where failure/loss of a single compute server can render all data stored on the simple disk units inaccessible to the users and/or other systems. As the amount of backup data increases, additional disk storage units are added, but since they cannot assist in deduplication processing, the end-to-end backup time of these systems increases to the point where it exceeds the backup window limits of the IT department's service level agreement.
Further, using conventional magnetic disk and/or magnetic tape and tape drive solutions that do not support deduplication, either within/among each medium/drive combination, each full backup of large backup images can consume as much capacity as the size of the large backup image. This can become time-consuming and expensive to store weeks, months and/or years of retention backup data. Moreover, with traditional deduplication appliances that have a single deduplication engine that acts as a front-end compute node for multiple simple disk storage units, even if a very large backup image was to be sent to the deduplication appliance, the front-end compute node may have no way to process the data in parallel since it contains a single deduplication engine. Additionally, if that deduplication appliance fails during the backup process, the backup operation can remain in a failure state until that appliance is repaired/replaced, thereby making this single compute node architecture inefficient.
Thus, there is a need for a deduplication system that can manage data backup activities of incoming backup data streams, and maintain a constant backup window as the amount of data to be backed up increases over time. Moreover, there is a need for a deduplication system that can minimize storage capacity consumption within a deduplication system's grid servers, reduce bandwidth (e.g., wide-area network (“WAN”) bandwidth), improve backup job completion rates, enable backup capacity that is larger than a single grid server can process, enable faster backup job completion, perform automatic load-balancing, etc.