1. Field of the Invention
The present invention relates generally to data compression and archiving. More particularly the present invention relates to a system and method for efficiently detecting and storing multiple files that contain similar or identical data. Still more particularly, the present invention is a method for detecting and storing full or partial duplicate file forks in an archiving system.
2. Discussion of Related Art Including Information Disclosed Under 37 CFR §§1.97, 1.98
Archiving software utilities such as STUFFIT®, PKZIP®, RAR® and similar products provide users with the ability to combine or package multiple files into a single archive for distribution, as well as by compressing and encrypting the files, so that bandwidth costs and storage requirements are minimized when sending the resulting archive across a communication channel or when storing it in a storage medium. [STUFFIT is a registered trademark of Smith Micro Software, Inc., of Aliso Viejo, Calif.; PKZIP is a registered trademark of PKWare, Inc., of Milwaukee, Wis.; and RAR is a registered trademark of Eugene Roshal, an individual from Chelyabinsk, Russian Federation.]
Quite often the files added to an archive are exact duplicates of one another, or very nearly so. Current archiving software, such as the archiving software utilities mentioned above, compress each data file as a whole, without detecting duplicate or partially duplicate files or file forks. It would be advantageous, therefore, to provide a method for detecting when a subset of files being added to an archive are identical files, or nearly identical. Then, instead of compressing and storing additional copies of the file data, the method could provide means for storing references to compressed data already present in the first archived copy of the file. Moreover, it is desirable that the detection and coding of the identical files be as time efficient as possible.
Current products use the concept of a “solid archive” or “block mode” to partially solve this problem. In this mode, input files are sorted by file attributes so that potentially identical files are ordered close to each other, and the resulting files are concatenated and compressed as a single large block. In some instances, compressors take advantage of the presence of nearby identical data, but this approach is highly dependent on the window size or the amount of history available to the compression program. Multiple large identical files will not be able to reference the data in the matching files processed previously, if the beginning of the second file is too remote from the beginning of the first file. Additionally, even if the identical files are within the given window size and the history of the first file can be used in compressing the next file that matches, this method does nothing to eliminate processing—the second file or fork data must still be compressed.