One of the most researched fields in information technology today is deduplication. Deduplication may be defined as removing duplicate copies of the same data (usually blocks or chunks of data) and replacing them with a reference (pointer) to a single copy of the data. A common example of this is an email that is sent to 100 people in an organization and thus gets stored on disk 100 times. The amount of redundant information can be very large if the email contains a large document.
Most archive programs stream the metadata, headers, and file data to a single file compacting and mixing them together thus making deduplication of such an archive file difficult for any deduplicating storage service (appliance, OS filesystem, storage system, . . . ) because the archive data will be in small chunks interspersed with metadata and header data which makes the file unique and difficult to deduplicate.
To resolve the problem of poor deduplication of archive volume formats, archive application vendors often incorporate deduplication algorithms directly in the archive program to permit storing the data that has been deduplicated during the archive process. This requires each vendor to develop and implement their own deduplication methods.
At the same time, more and more filesystems and storage systems have deduplication technology built-in so that the user's data is directly deduplicated when it is stored in a file, whether or not it is processed by an archive program. Thus vendors with built-in deduplication algorithms create inefficiencies by trying to deduplicate a second time or by requiring the Operating System to reconstruct the original non-deduplicated data only for it to be deduplicated again in the archive process.
Most deduplication research (patents) today provide methods to improve the speed and efficiency of deduplicating data. The invention herein described teaches creating a universal archive volume that may be subsequently optimally deduplicated by existing and future deduplication methods developed for filesystems, for storage systems, and by third parties. User settable options permit optimally tuning the archive volume for most efficient deduplication.
When an archive program creates an archive image (called a volume), this image contains many files, and each file has metadata (time, date, permissions, . . . ), headers, and other data interspersed with the actual file data. In fact, the file data is generally broken into small chunks (65K bytes) for transmission from the client machine to the storage machine and is then compacted into the archive image (volume) with additional header data that permit reconstructing the original file that was backed up. This means that the original file that could be easily deduplicated is now stored in smaller chunks interspersed with archive information, and thus the traditional archive volume, represented in the bar labeled traditional archive volume of FIG. 7, becomes unique and does not deduplicate well using existing deduplication algorithms.
Some storage systems have now included deduplication technology that recognizes vendor specific archive formats and separates or filters the vendor's metadata and header data so that the actual file data can be deduplicated. One problem with such methods is that they must be adapted differently for each vendor's archive format, and if a vendor makes the slightest change in their archive format, the deduplication will no longer work or at best will require new or additional vendor specific changes in the deduplication algorithms.
The following are patent publications that describe technology related to the technical field of the invention:
1. U.S. Pat. No. 8,055,618 B2 Nov. 8, 2011, Anglin, Data Deduplication By Separating Data From Meta Data;
2. U.S. 2012/0084269 A1, Apr. 5, 2012, Vijayan et al.,
CONTENT ALIGNED BLOCK-BASED DEDUPLICATION;
3. U.S. 2009/0182789 A1, Jul. 16, 2009, Sandorfi et al., Scalable De-Duplication Mechanism.
The invention described herein attempts to leverage on the fact that filesystems and storage systems already have built-in deduplication and exploits this by creating an archive volume format that can be easily and optimally deduplicated by any deduplication algorithm and without vendor specific filters.