<File Deduplication Storage Technique>
A method referred to as deduplication is known as a technique for storing a file in a storage medium while avoiding duplication of the file. Based on deduplication, when a file storage apparatus, which stores files created by a plurality of computing terminals in a centralized manner, stores a file in a physical storage medium such as a hard disk drive, the file storage apparatus determines whether the file overlaps with a previously stored file. If the file overlaps with a previously stored file, instead of storing the file in the storage medium, the file storage apparatus stores only pointer information with respect to the previously stored overlapping file. In this way, it is possible to reduce the physical storage capacity.
Usually, in deduplication, whether a file overlaps with a previously stored file is determined per file or per physical data block fixedly allocated when the file is stored in a storage medium on a file system. In this operation, small digest data of several dozen to several hundred bits generated by a hash function such as SHA1 (Secure Hash Algorithm 1) or MD5 (Message Digest 5) used in digital authentication and the like are compared with each other, to determine whether the files or data blocks are formed by the same byte string.
By using such duplication determination method with digest data, the process cost required for duplication determination executed on a file storage apparatus can be reduced. In particular, in a storage process in which a high-speed I/O process needs to be executed, by executing duplication determination simultaneously with an I/O process, a decrease in I/O process performance can be prevented.
Such deduplication-type storage system having digest data as a duplication determination means is used as a means for reducing the file storage cost of a file storage apparatus which is for storing backup files or a file storage apparatus which is for storing image files of system portions of a plurality of virtual OSs (Operating Systems), particularly in a computing environment where many files or data blocks formed by the same byte string exist.
<File Retrieval Technique based on Similarity of File Features>
An image retrieval method for extracting an image file similar to an input image file from among the image files belonging to an image file group is known. Based on this method, color information included in each of the image files in the image file group and shape information depicted in each image are formalized and stored as file feature information, and the feature information is compared with feature information of the input image file.
Based on this image retrieval method, the type or the number of items of the image file feature information used for comparison or an algorithm used for comparison is changed, depending on the level of similarity between the input image file and an extracted image file. In this way, the accuracy or speed in extracting a desired image file can be improved.
Such image retrieval technique has already been put to practical use as a system for retrieving an image file on the Internet similar to a reference image. For example, the image retrieval technique is in practical use as a WEB service handling Internet contents.
Thus, in the case of extracting a file in a format other than that of an image file such as a text file or a moving image file, such system that extracts semantic information embedded in a file as feature information and extracts a similar file based on the feature information can be established in a similar approach, while the information extracted as the feature information and the comparison method differ depending on the file formats processed.
Patent Literature 1 discloses a file system that realizes a file backup operation in which the same files are efficiently accumulated during file processing and that allows users to use the processed files easily when necessary.
In addition, Patent Literature 2 discloses an electronic file storage method. According to this method, when the same electronic files are stored, the files are stored as a single electronic file, thereby saving the memory capacity. In addition, the files appear to be stored in a directory structure specified by each user.
In addition, Patent Literature 3 discloses a graphic retrieval device that can realize higher accuracy and efficiency in graphic retrieval processing by automatically setting the type of an optimum shape feature quantity that should be used when retrieving a query graphic image.
[PTL 1]
    Japanese Patent Kokai Publication No. JP2000-057159A[PTL 2]    Japanese Patent Kokai Publication No. JP2005-157768A[PTL 3]    Japanese Patent Kokai Publication No. JP2007-149018A