1. Field of the Invention
The present invention relates to a duplicate file detection device and a duplicate file detection method used for detecting duplicated files, as well as to a computer-readable storage medium having recorded thereon a software program used for implementing the same.
2. Background Art
Well-known methods of detecting duplicated files have heretofore included a method, in which detection is accomplished by comparing file meta-information (time stamps, size, etc.), and a method, in which detection is accomplished based on comparing file contents (binary data, extracted text data, etc.).
Of these, the method of detection based on comparing file meta-information is subject to the possibility of actually non-duplicate files being falsely detected as duplicated files. By contrast, in the method where the determination of duplicates is accomplished by comparing file contents, the likelihood of false detection is lower. However, in the method where the determination of duplicates is accomplished by comparing file contents, time may be required for direct comparison of source data during detection.
For this reason, in order to reduce processing time in the method where detection is accomplished by comparing file contents, it has been proposed to compute hash values from the source data calculated using algorithms such as MD5 (Message Digest Algorithm 5) and SHA (Secure Hash Algorithm) etc. and make judgments about the presence or absence of duplicates based on the comparison.
However, even when comparison is performed using hash values, considerable time is required for the file loading process alone if the hash values are calculated based on complete files. For this reason, in the method disclosed in JP 2007-201861A and WO2006/129654, time reduction is achieved by calculating hash values from a portion of the data in the files.
For example, JP 2007-201861A has disclosed a method of detecting duplicate image files using hash values. In the method disclosed in JP 2007-201861A, identification of image files of matching file sizes is performed as the first step in the determination of image file equivalency. Subsequently, if image files of matching sizes are present, partial hash values that summarize the respective beginning portions of said image files are computed and these values are compared. Then, if there is a match between the partial hash values, full hash values that summarize the files in their entirety are respectively computed for said image files and these values are compared.
In addition, WO2006/129654 has disclosed a method, in which alterations made during file updates are detected using hash values. In the method disclosed in WO2006/129654, files are first segmented into blocks of a predetermined size and hash values are calculated for the resultant block units. Next, the hash values of some blocks are compared and file alterations are detected based on the result of the comparison.
However, the method of calculating hash values based on the data of specific portions, as disclosed in these documents, is effective only when examining files of a limited number of types and the problem with this method is that it cannot be applied when examining files of various types. The reason is that the locations that are modified when the files are updated are not the same in files of all types.