The present invention relates to a search method that uses an index file that is composed of a key file, including key character strings, and a position data file, including position data that correspond to key character strings, and to an apparatus used for the search method.
In order to quickly search for in-house document data or a home page on the Internet, conventionally, an index is prepared for character strings that appear in documents that are sought, and then an all-sentences search is conducted, based on the index, to examine all available documents for the desired character string, or document. The importance of such a search is acknowledged; however, with time, the amount of data searched for increases, and since the search index is thereby expanded, the required hard disk space may grow until it is almost prohibitively large. Further, for an all-sentences search, all character strings for which search requests are submitted must be included, and a corresponding index be prepared. Thus, again, the size of the index increases, and as it does, so too does the size of the obtained results; and this makes it difficult for a user to find a desired document. Furthermore, unsuccessful searches, for character strings for which positive results are not obtained, may occur due to an existing system/resources relationship.
The following two conventional methods are well known techniques employed to reduce the size of index files, and thus resolve the above problems. For the first method, a stop word system is employed. According to this method, a list is prepared of words such as THIS, A and THE in English, for example, that seem to be most frequently used, and these words are not included in an index file. For the second method, a compressing scheme is used to reduce the size of the index information.
Although the stop word method and the compression method can reduce the size of an index file, the following shortcomings make them less effective.
For the stop word method:
Although the amount of information included for a character string that is frequently used can be reduced, once an index for FORUM has been prepared, information can not be deleted for a word, such as FORUM or APPENDED, that appears more frequently and is inherent to an index, and that is not searched for using the pertinent index.
Since stop words are inherent to a language, stop words that are unique to a pertinent language must be selected.
Although a stop word may be included in a string for which a search request is submitted, data relative to the stop word is always deleted, and can not be searched for.
If the number of words that can be handled by a system is set, and there is an increase in the size of an index that causes it to exceed the limit, even though a large number of search results may be obtained, a search will be interrupted and the system will be adversely affected, because with the stop word method, index information that is not needed for a search can not be deleted.
For the compression method:
Although the size of the index can be reduced by data compression, the index information that is not actually required for a search can not be deleted.
A technique for eliminating inefficient searches and useless search results is disclosed in Japanese Unexamined Patent Publication No. Hei 10-171692. However, this technique involves the deletion, from a search index, of very common words that are located at the ends of the index terms, and provides an approach that differs from the method for reducing the size of a position data file that constitutes a problem when an index for an all-sentences search is prepared.
To resolve the above shortcomings, it is one object of the present invention to provide a search method using an index file, whereby the size of an index file can be considerably reduced, and an apparatus that is to be used for the search method.
A search method according to the present invention that uses an index file is a method for using an index file consisting of a key file, which includes key character strings, and a position data file, which includes position data corresponding to the key character strings. According to this search method, a position data delete flag is correlated with a specific key character string, and position data that correspond to the specific key character string are deleted from the position data file. In one preferred aspect, the position data is deleted when a position data size, corresponding to the specific key character string, is attained that with the position data file size provides a specific ratio. Furthermore, the position data is also deleted when the size of the position data reaches a specific value that corresponds to the.specific key character string.
An apparatus used for the search method of the present invention comprises: a new difference index preparation unit for preparing a new difference index file using a newly registered document; an index merge unit for merging a conventional index file with the new difference index file prepared by the new difference index preparation unit, for determining whether the above described position data file is to be deleted, and for preparing a new index file; and a search unit for beginning a search based on the new index file generated by the index merge unit.
According to the present invention, the position data delete flag is correlated with a specific key character string, and the position data that corresponds to the specific key character string is deleted from the position data file. Therefore, while a request for an all-sentence search is satisfied, the position data for a character string that can not actually be employed for a search can be deleted, so that a considerable reduction in the size of an index file can be realized.
In another preferred aspect, the structure of a key file is constituted by a key character string, the location of the key character string in the position data file, the size of the position data, and the position data delete flag. A group of key character strings is specified in advance, for which position data are not to be deleted, even though it has been determined that the position data are to be deleted. When the position data delete flag, corresponding to a specific key character string, is set, position data for the key character string is not added to the position data file for the index file. When the position data delete flag, corresponding to a specific key character string, is not set, position data for the key character string is added to the position data file for the index file. A search is to be performed by a method that uses a search key character string consisting of one word, a method that uses one search key character string when an index is prepared using the N gram method, or a method that uses using a search key character string consisting of a plurality of words. All of these methods can appropriately carry out the present invention.