1. Field of the Invention
This invention generally relates to digital data storage and, more particularly, to a system and method for the client-side deduplication (dedup) of data being stored across a distributed network of hosts.
2. Description of the Related Art
Currently storage capacity grows at least 36% every year. In times of economic prosperity, as in the mid-2000s, storage capacity growth can reach as high as 90%. With the emergence of cloud services, storage has become a pivotal technology and is often the basis of any on-line service. As a result of the growing quality and complexity of the data and its availability on the Internet, storage capacity is likely to grow even faster in the future. Typically, cloud service providers rely on clusters of commodity equipment to deliver their service. Scale-out storage (such as a distributed file system) will represent the majority of the operating storage system as the service grows. The ability to offer data reduction in a distributed environment will be critical to the profitability of a business.
With a growth in capacity comes a need for more drive into a storage shelf and, therefore, more electric power. Since a large majority of data is duplicate, and it should be unnecessary to spend resources to save copies of data already in storage. For example, a customer may generate weekly graph reports of their operations. Typically, the graph report contains the same information from one week to another. By using deduplication, a storage device can eliminate redundant blocks of data and replace them with a pointer to a common block.
As noted in Wikipedia, data deduplication is a specific form of compression where redundant data is eliminated, typically to improve storage utilization. In the deduplication process, only one copy of the data is stored. However, the indexing of all data is retained should that data ever be required. Deduplication is able to reduce the required storage capacity since only the unique data is stored. For example, a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy. In this example, a 100 MB storage demand is reduced to only 1 MB. Different applications have different levels of data redundancy. Backup applications generally benefit the most from de-duplication due to the nature of repeated full backups of an existing file system.
In addition to saving disk space; acquisition costs, power, and cooling requirements are reduced, making a disk suitable for first stage backup and restore, and for retention that can easily extend to months. Also, restore service levels are higher, media handling errors are reduced, and more recovery points are available on fast recovery media. Advantageously, data deduplication reduces the data that must be sent across a network for remote backups, replication, and disaster recovery.
Deduplication solutions work by comparing chunks (blocks) of data to detect duplicates. Each block of data is assigned a presumably unique identification, calculated by the software, typically using cryptographic hash functions. A requirement of these functions is that if the data is identical, the identification is identical. Therefore, if the software sees that a given identification already exists in the deduplication namespace, then it will replace that duplicate chunk with a link. Upon read back of the file, wherever a link is found, the system simply replaces that link with the referenced data chunk. The de-duplication process is intended to be transparent to end users and applications.
In some systems, blocks are defined by physical layer constraints (e.g., 4 KB block size in write anywhere file layout (WAFL)). In some systems only complete files are compared, which is called Single Instance Storage or SIS. The most intelligent (but CPU intensive) method is sliding-block. In sliding block, a window is passed along the file stream to seek out more naturally occurring internal file boundaries.
Source deduplication ensures that data on the data source is deduplicated. This generally takes place directly within a file-system. The file system periodically scans new files creating hashes, and compares them to hashes of existing files. When files with the same hashes are found, the file copy is removed and the new file points to the old file. Unlike hard links however, duplicated files are considered to be separate entities. If one of the duplicated files is later modified, then a copy of the file is written or a changed block is created. Target deduplication is the process of removing duplicates of data in the secondary store. Generally this is a backup store such as a data repository or a virtual tape library.
There are three different ways performing the deduplication process. In a client backup deduplication, the deduplication hash calculations are initially created on the source (client) machines. Files that have identical hashes to files already in the target device are not sent, the target device just creates appropriate internal links to reference the duplicated data. The benefit of the client backup approach is that the unnecessarily sending of data across a network is avoided, thereby reducing traffic load.
With post-process deduplication, new data is first stored on the storage device and then a process, at a later time, analyses the data looking for duplication. The benefit to this approach is that there is no need to wait for the hash calculations and lookup to be completed before storing the data, thereby ensuring that store performance is not degraded. Solutions offering policy-based operation can give users the ability to defer optimization on “active” files, or to process files based on type and location. One potential drawback is that duplicate data may be unnecessarily stored for a short time, which is an issue if the storage system is near full capacity. Another issue is the unpredictability of knowing when the process will be completed.
In-line deduplication is a process where the deduplication hash calculations are created on the target device as the data enters the device in real-time. If the device spots a block that it already stored on the system, it does not store the new block, just references it to the existing block. The benefit of in-line deduplication over post-process deduplication is that it requires less storage, as data is not duplicated. On the negative side, it is frequently argued that because hash calculations and lookups takes so long, data ingestion can be slower, thereby reducing the backup throughput of the device.
Since most data deduplication solutions are slow, they are more suited to secondary storage in an offline mode. This typically includes the backup process, which can be done in batch offline mode. Most of the post-processing systems fall into this category.
Data deduplication solutions rely on cryptographic hash functions for identification of duplicate segments of data. A hash collision may result in data loss if a block of data is replaced by incorrect data. To address this problem, very large hash values may be used, so that statistically there is a far greater chance of hardware failure than a hash collision. Solutions utilizing post-process architectures may offer bit-for-bit validation prior to garbage collection of original data for guaranteed data integrity. Some examples of hash algorithms include MD5, SHA-1, and SHA-256.
Another major drawback of data deduplication is the intensive computation power required. For every file or block of data, all the bytes are used to compute a hash value. The hash then needs to be looked up to see if it matches existing hashes. To improve performance, a system may use a combination of weak and strong hashes. Weak hashes are much faster to calculate, but there is a greater chance of a hash collision. Systems that utilize weak hashes may subsequently calculate a strong hash, and use the strong hash as the determining factor as to whether data blocks are the same. The system overhead associated with calculating and looking up hash values is primarily a function of the deduplication workflow. The “rehydration” of files does not require this processing and any incremental performance penalty associated with re-assembly of data chunks is unlikely to impact application performance.
Scaling has also been a challenge for deduplication systems because the hash table needs to be shared across storage devices. If there are multiple disk backup devices in an infrastructure with discrete hash tables, then space efficiency is adversely affected. A hash table shared across devices—a global dedup hash table—preserves space efficiency, but is technically challenging from a reliability and performance perspective.
Thus, there is currently no practical ability to deduplicate data blocks in a cluster file system. Some vendors, such as NetApp, run their storage in an active-active fashion. Deduplication is offered on both systems even during a failover only because the deduplication table is written in two storage controllers for each transaction. Such a process is wasteful of bandwidth and processing resources. Further, deduplication technology typically focuses on eliminating the redundancy once the data are on the storage device. Most clients, however, are limited to 1 gigabit per second (Gbs) throughput, and the overall process is slowed by the communication dedup data that is not stored.
It would be advantageous if there was a practical means of performing deduplication across a distributed infrastructure of network-connected hosts.