In recent years, along with higher performance and lower costs of computer systems, the utilization of such computer systems is spreading in various types of business and for various purposes. Consequently, not only data that has conventionally been processed using paper media and the like but also music and moving pictures are computerized into multimedia data, and are saved in the computer system as data files. Further, the utilization of computer systems that are connected to one another via a network is also rapidly advancing. As a result, data can be managed and processed in a distributed manner, thus enabling a large-capacity storage that has been difficult to achieve with only one computer system as well as enabling high availability, high reliability, and high performance.
In addition, in recent years, the number of data files held in the computer system is enormous, and hence there arises a problem that a user cannot know where a required file is stored. In order to deal with this problem, a full-text search service is used recently. The full-text search service is provided by analyzing data of files stored in the computer system and storing in advance a search index corresponding to the respective pieces of data in a search server. The search service is provided broadly in the following procedure. First, the user transmits a search query to the search server. The search query is a character string for searching for a file to be acquired. The search server searches a search index on the basis of the received search query, and transmits a search result to the user. The user accesses a target file on the basis of the search result. Meanwhile, the number of data files stored in the computer system will be increasingly larger hereafter. Consequently, it will be further difficult for the user to perfectly know where which data file is stored. Accordingly, the importance of the search service will be increasingly higher hereafter, so that the utilization of the search service will further spread.
In order to create a search index, the conventional search server accesses a computer system that stores search target files, and then creates the search index after acquiring the target files. Unfortunately, in many cases, data files having the same contents are stored in an overlapping manner in the computer system that stores the search target files. For example, in the case where the computer system provides a file storage service to a plurality of users, the users may each create a copy of a file having the same contents and individually store the copy. In such a case, the conventional search server uniformly acquires even files containing the same data (so-called duplicate files) as ones of the target files from the computer system to create a search index. Unfortunately, this method causes an increase in the processing load of the search index as well as an increase in the volume required to store the search index.
A technique using a duplicate elimination method is disclosed as a method of eliminating the waste caused by creating a search index corresponding to such duplicate files. Specifically, there is disclosed a technique of deleting, in the search server, duplicate data from files acquired from the computer system and creating a search index for the remaining data files (Patent Literature 1). In the search server adopting this technique, the number of target files for creating a search index can be reduced by eliminating the duplicate data, which leads to a reduction in the processing load required to create the search index and a reduction in the volume required to store the search index.
In addition, there is disclosed a technique of: causing the search server that cooperates with the computer system to acquire in advance the duplicate condition of search target files from the computer system; thus preventing the search server from acquiring duplicate files in an overlapping manner; and creating a search index for only target files not including the duplicate files (Patent Literature 2). In the search server adopting this technique, the duplicate files do not need to be acquired from the computer system, which leads to a reduction in the number of files exchanged between the computer system and the search server. As a result, in the case of adopting this mechanism, even the network load can be reduced in addition to the effects obtained in Patent Literature 1.