1. Field of the Invention
The present invention relates to systems and methods for storing and transmitting data. In particular the present invention relates to a system and methods for storing data that are configured to leverage a content-aware and adaptive deduplication process as a form of electronic data compression for efficiently storing and transmitting data.
2. Background
Deduplication involves identifying similar or identical patterns of bytes within a data stream, and replacing those bytes with fewer representative bytes. By doing so, deduplicated data consumes less disk storage capacity than data that has not been deduplicated and when the data stream must be transmitted between two geographically separate locations, consumes less network bandwidth. Adaptive deduplication strategies combine inter-file and/or intra-file discovery techniques to achieve the aforementioned goals.
Deduplication can be used to reduce the amount of primary storage capacity that is consumed by email systems, databases and files within file systems. It can also be used to reduce the amount of secondary storage capacity consumed by backup, archiving, hierarchical storage management (HSM), document management, records management and continuous data protection applications. In addition, it can be used to support disaster recovery systems which provide secondary storage at two or more geographically dispersed facilities to protect from the total loss of data when one site becomes unavailable due to a site disaster or local system failure. In such a case, deduplication helps to reduce not only the amount of data storage consumed, but also the amount of network bandwidth required to transmit data between two or more facilities.
Many popular deduplication apparatus employ deduplication methods that are not aware of specific application-level content within their incoming data streams. Examples of application-level content include but are not limited to Microsoft Exchange data stores, Microsoft SQL Server and Oracle databases, Solaris, Windows, and Linux file systems, Microsoft and VMware virtual machine images, Network Data Management Protocol (“NDMP”) dumps, etc.
The lack of application-level content awareness in many deduplication apparatuses precludes their ability to identify the data type or types that are not achieving acceptable levels of deduplication. This occurs, for example, when the incoming data stream includes regions of data that are encrypted or pre-compressed, or with databases that are re-indexed—all of which typically produce below average deduplication ratios. Poor deduplication ratios cause an increase in the consumption of local disk storage capacity and inter-site WAN bandwidth.
While content awareness is a key element of a manageable deduplication system, another important architectural metric is related to the size of the managed deduplicated objects. If the size of each deduplicated object is set to be too small, the amount of metadata that must be employed to manage each small object becomes untenable. As an example, most deduplication systems that are commercially popular today operate on a model of identifying identical chunks of kilobyte (“KB”) sized deduplicated objects. On a 10 terabyte (“TB”) appliance with a 10 KB average chunk size, one billion deduplicated objects must be identified and managed. With commonly available main memory system capacities, it is unlikely that the entire metadata collection of one billion deduplicated objects can be maintained in memory, so performance is degraded as metadata must be paged into and out of memory during the metadata matching processes.
Thus, there is a need to provide an adaptive deduplication technique that operates on the premise of identifying and managing regions of contiguous bytes, termed “zones,” from an incoming data stream as large as tens of megabytes (“MB”) that might be similar, but not necessarily identical to other zones so that these zones produce very effective deduplication. By managing zones of relatively large size, the amount of metadata that must be maintained is reduced by three orders of magnitude, allowing all zone metadata to be easily retained in main memory during deduplication processing.