1. Field of the Invention
The present invention relates to a document search technique using a combination of two search methods of an index-type search and a scan-type search while compensating for the disadvantages of the two search methods with each other.
2. Description of the Related Art
There are two methods for searching a document. The first one is a method called a scan-type search in which a document satisfying a search query is retrieved while documents to be searched are checked one by one. Actually, whether or not each search keyword appears is checked while each document is read from the beginning part. An AC method (Aho, A. V, Corasick, J., “Efficient string matching: an aid to bibliographic search,” Communications of the ACM, 18(6), pp. 333-340, 1975), a CW method (Gonzalo Navarro, Mathieu Raffinot, “Flexible Pattern Matching in String,” Cambridge University Press, 2002) to perform skip reading, and the like are known as the scan algorithm. The other one is a method called an index-type search in which: a list (index) of documents including each search term is constructed in advance, and in which, at the time of searching, an index is checked to obtain a set of documents of search results. For the details of the index-type search, including the method of constructing the index, see Baeza-Yates, R., Ribeiro-Neto, B., “Modern Information Retrieval,” Addison-Wesley, 1999.
The two methods described-above have advantages and disadvantages, respectively. The scan-type search is slow in searching because the documents are checked one by one. On the other hand, the index-type search is fast in searching because only the index constructed in advance needs to be checked. However, the index in addition to the document data needs to be maintained. Depending on the information included in the index, the index size may be several times as large as the total document size. Moreover, every time a document to be searched is added, deleted or modified, the index also needs to be updated to reflect the latest condition. In the scan-type search, in contrast, secondary data such as the index is not required, and the search can be performed only if the original document data exists.
Moreover, even though the index-type search is fast, the search speed becomes slow in proportion to the increase in search keyword. Especially in the case where the index is compressed, this tendency is prominently true due to the decompressing processing of the compressed index. Under some circumstances, the search speed may become even slower than that in the scan-type search. Generally, the search speed of the index-type search is inversely proportional to the total number of hit documents for all the search terms. On the other hand, the search speed of the scan-type search does not depend largely on the search query.
A hybrid type search can be conceivable in which above-described two methods are combined. The conventional search using a character component table can be categorized as the hybrid type search. In this search, an index-type search is firstly performed by using a simple and small size index. Subsequently, a scan-type search is performed to a set of documents of the search results. The index-type search here only needs to function as a screen, and does not have to achieve an accuracy of 100% but only has to provide a search result having no documents overlooked. Indexes employable as such index include an index of character 2-grams in which neither a character component table nor positional information is stored, and other kinds of indexes. When the index-type search as a screening function is performed prior to the scan-type search, it is not necessary to check all documents by using the scan-type search. Thus, the disadvantage of slow speed in the scan-type search can be overcome. It should be noted, however, that the index is still required even through the size is small.