1. Field of the Invention
The present invention relates to a technique for generating Boolean search formulas for searching documents.
2. Background Art
There are mainly two types of methods in document search. A first method is a method in which a Boolean formula with a combination of presences of keywords (arbitrary character strings) is inputted, and only documents that the Boolean formula evaluates to “true” are outputted as search results. The method is generally called a full text search. The Boolean formula with a combination of presences of keywords will be called a Boolean search formula. A second method is a method in which a text is inputted, and documents similar to the text are ranked in the order of similarity and outputted as search results. The method is generally called a similarity search.
A topic to be searched can be directively described as a text in the similarity search, and even a person who is not an expert of the document search can easily use the similarity search. The search results are displayed with ranks, and the user can preferentially examine higher-ranked documents which seem important. On the other hand, it is difficult to check the reason why the documents are ranked higher.
Factors of the similarity in the similarity search include overlapping of word distribution between the inputted text and documents obtained as search results and the length of the documents obtained as search results. Therefore, it is difficult to simply express the basis of the similarity in natural language. The mechanism of the similarity search is hidden, and the basis of the similarity is often undisclosed.
If the basis on which the documents are obtained as the search results is not known, the user cannot recognize how much the search results need to be examined. The user cannot check whether or not the desired topic is completely searched.
The similarity search is suitable for a situation in which it is sufficient if even one desired document exists in higher ranked few documents, as in the search of Web pages. However, the similarity search is rather inefficient in a situation in which a topic needs to be comprehensively examined, as in the search of patent documents and academic papers.
Meanwhile, in the full text search, a topic to be searched needs to be expressed by search formulas formed by Boolean expressions of keywords, and know-how and expertise for establishing the Boolean search formulas are required. However, since the documents are searched based on the Boolean search formulas, the standard is clear and plain for the user. If the user examines all searched documents, it can be stated that all documents of the topic expressed by the Boolean search formula are examined.
To alleviate the problem of the similarity search, some methods are proposed. In JP Patent Publication (Kokai) No. 10-74210A (1998), distinctive words in upper several dozen documents searched in the similarity search are extracted, and the words are outputted with the search results. An overview of the search results can be understood by viewing a set of the extracted distinctive words.
In “Scatter/Gather: a cluster-based approach to browsing large document collections”, Cutting, D., Karger, D., Pedersen, J., Tukey, J. pp. 318-329, ACM SIGIR'92, 1992, the search results are displayed by clustering the search results into several groups based on the similarity between the documents. As a result of the clustering, the topics included in the search results are automatically classified. Therefore, features of the search results can be more easily understood compared to the method of JP Patent Publication (Kokai) No. 10-74210A (1998).
In “Supporting the Query Modification by Making Keyword Formula of an Outline of Retrieval Result”, Yasunori Matsuike, Koji Zettsu, Satoshi Oyama, Katsumi Tanaka, Proceedings of Data Engineering Workshop (DEWS 2005), 1Ci9, 2005, Boolean formulas of keywords as a basis of the search results are generated from the search results. In the document, keywords that cover the search results as widely as possible are found. If the coverage of the found keywords is not sufficient, keywords that cover the remaining document set are found again. This is repeated to find keywords that can sufficiently cover the search results, and the keywords are connected by products and a sum to generate a Boolean search formula. The generated Boolean search formula is presented to the user as a tree-structured graph.
In the techniques described in JP Patent Publication (Kokai) No. 10-74210A (1998) and “Scatter/Gather: a cluster-based approach to browsing large document collections”, Cutting, D., Karger, D., Pedersen, J., Tukey, J. pp. 318-329, ACM SIGIR'92, 1992, distinctive words included in the results of the similarity search are extracted, and the words can be presented as the basis of the similarity search. However, the distinctive words do not always indicate the accurate basis of the similarity search.
In the technique described in “Supporting the Query Modification by Making Keyword Formula of an Outline of Retrieval Result”, Yasunori Matsuike, Koji Zettsu, Satoshi Oyama, Katsumi Tanaka, Proceedings of Data Engineering Workshop (DEWS 2005), 1Ci9, 2005, only the high-coverage of the results of the similarity search serves as the evaluation standard in extracting the words. Therefore, the extracted words may hit a large number of documents (noise) other than the results of the similarity search. The words are not appropriate as the basis of the similarity search.
The present invention has been made to solve the problems, and an object of the present invention is to provide a technique for accurately and efficiently generating Boolean search formulas that serve as a basis for similarity search.