1. Field of the Invention
The present invention relates to a system and method for searching a large volume of documents stored in a computer. More particularly, the invention relates to a system and method for searching for stored documents that are similar to a document having one or more particular string characteristics at a high speed while allowing desired ambiguity in the search of the stored documents.
2. Background Art
In known search methods for locating a document of interest from a set of electronic document texts, it is common practice to input a search expression in which character strings indicative of the subject matter of interest are connected by logical operators such as AND, OR, or NOT. An example of a search expression follows.
(computer OR personal computer) AND search PA1 Computer, 60 personal computer, 60 search, 100
This method gives all operation and discretion to the user in converting the subject matter of interest into the search expression. It is often troublesome to formulate suitable search character strings and to input the suitable search expression. In addition, the quality of the search results greatly depends on skill of the user in constructing the search expression.
A method is also known where each search character string of the search expression is weighted according to importance of the string. A result of such a search comprises sequentially output documents corresponding to those containing higher weight of the search character strings. An example of a weighted search expression follows.
Like the first conventional method, this weighting method also gives all operation and discretion for converting the subject matter of interest to the search expression to the user. The above methods further require the user to fully understand the contents of documents to be searched and the terms being used in the search expression. Therefore, if a user vaguely thinks "I want to read sentences with such sense", it is difficult to construct a meaningful search expression from this general statement.
On the other hand, there is a technology as described in Published Unexamined Patent Application (PUPA) No. 6-124305 to is input natural language expressions for a search, to extract search keys, and to perform the search based on the extracted search keys. Such a search requires a search key dictionary. In a method performing extraction based on vocabulary information (word dictionary) such as the search key dictionary or grammatical connection rules, the word dictionary or the grammatical connection rules are generally non-dynamic. Therefore, a new word such as "TOYSARUS" ("TOYSARUS" is a trademark of Toysarus Inc.) or a phrase such as "footprint of dinosaur" cannot be extracted as a unique character string without great difficulty. In addition, the concepts or perceived "features" of the contents of a document may change over time. For example, in the past, white-collar workers always wore suits when they came to the office. More recently, there are many cases where white-collar workers may not wear suits because many companies have adopted a "casual day" system. To keep pace with such changes, it is necessary to continuously update the word dictionary to include new words, new trademarks, trade name, and product names, etc. However, such updating requires enormous labor, and the region for storing the word dictionary or the like is increased as new words as added. This, in turn, would adversely affect search speed.
In addition, PUPA 6-223114 describes a method for processing character strings based on the frequency of the appearance of a word. However, such technology is used to determine the type of a document, or to extract keywords for a search by searching a registered word list (word dictionary) to determine whether words in the document exist in the word dictionary. Unlike these conventional search methods, the present invention investigates the frequency of the appearance of a word or character string in an input document and a comparison document, and utilizes the frequency of appearance in both of them. (In the word list-based method, one word appears only once in the dictionary so that it is meaningless to investigate the frequency of appearance in the list). Accordingly, in PUPA 6-223114, a stationary word dictionary is still necessary so that there still remains the above-mentioned problem that a new word or phrase cannot be easily extracted as a unique character string. In addition, since the known technology detects keywords based on stationary word dictionaries by category, if there are multiple documents describing "methods for searching documents" for example, there is a high possibility that the keywords being extracted are very similar ones such as "search", "character string", and "high speed". Therefore, it is difficult to extract keywords for differentiating each document, introducing inefficiencies in the search document.
An object of the present invention is build a search system which enables the user to a input a complicated search concept with a very simple operation, such as the clicking of a button, without requiring the user to think up or to input a search expression.
Another object of the present invention is to provide a search method which parallels a human thought process to enable a user to easily input complicated and abstract concepts for a search.
Another object of the present invention is to provide a search method which reduces labor of the user to think up or to input a search character string or a search expression, to provide a search method which can be easily used by everybody, and enables a user to perform search even if the user does not accurately understand a keyword to be used for the search.
Another object of the invention is to provide a search method which relatively and dynamically extracts a unique character string without using vocabulary information or grammar information.
Another object of the present invention is to provide a search method which requires less storage capacity and extracts a unique character string at a high speed.