A search engine is available as a system to search documents stored in a database connected with a network such as the Internet. Some of the search engines have a full-text search function to search a specific character string from a plurality of documents.
Such a full-text search engine equipped with the full-text search function is classified into a sequential search type and an index type, where the sequential search type search engine scans the contents of a plurality of documents one by one to search character strings. Whereas, when enormous number of documents have to be searched, thus taking a long time for the sequential search to make a search, the index type search engine creates beforehand an index with a table structure made up of a character string, a location of the document, an update time, an occurrence frequency and the like, and accesses the index at the time of the search, thus enabling a fast search.
The index used for the index type search engine has various formats, typically including an inverted index with a variable-length record made up of words and a document file ID including the words.
Referring now to FIGS. 1 and 2, three documents, an inverted index corresponding thereto, and a data structure to keep collected documents are exemplified in the following. The documents illustrated in FIGS. 1A to 1C have document file IDs of 1 to 3, respectively, which are all e-mail documents. FIG. 2A illustrates an inverted index made up of a word serving as a key and an ID including the word, where documents including the words of “PHP”, “” (“Suzuki” in English), and “” (“code” in English) are associated therewith. FIG. 2B illustrates an entry example of a data structure to store the documents collected, where a word serving as a key and the contents of a document corresponding to the word are associated with each other. In FIG. 2B, the words are listed in the left column, and the document contents corresponding to the selected words are shown in the right column.
The full-text search engine returns, as a search result, a group of documents where a word matching with a search word appears. Such techniques of judging a similarity between documents as a whole are described in Patent Documents 1 to 3, for example.
These techniques do not consider what character string includes the word matching with the search word in the document. Therefore, when the search result includes a large number of documents, it is difficult to find out a document required truly without imposing a burden. For instance, when the search word exists in a template for document, all of the documents using the template will be returned, thus imposing a burden to find out a document as a true target including the search word in its main body from the search result. Herein, the template refers to a header or a footer of a document, a menu at a Web site, a signature of e-mail, or the like.
In the case of e-mail, reply mail or forwarded mail often includes a copy of their original mail at the end thereof. If the copy part includes a search word, then the returned search result will include the mail even when a main body of the mail does not include the search word. Such a case causes noise when a search has to be conducted for mail including the search word in its main body.
Therefore, if the documents including the search word in the same character string in their main bodies can be collected into one group, the number of the documents to be evaluated is reduced, thus making it easy to find out a document required truly.
For instance, a technique of detecting documents having overlapped contents with consideration given to occurrence positions of a search word has been proposed (see Patent Document 4), which extracts and compares character strings including the search keyword for each of the documents included in a search result detected.
FIG. 3 illustrates the configuration of a search engine described in Patent Document 4. The search engine 10 is connected with a data source 20 keeping documents to be searched, and is further connected with a client device 30 that outputs an inquiry (query) input by a user to acquire a search result.
The search engine 10 is provided with a database 11 that registers documents therein, and a crawler 12 that acquires documents on the data source 20 at regular intervals to create an index. The crawler 12 repeats an operation of requesting a copy of a document used for index creation, tracing a link included in the document, and collecting another document. When the crawler 12 finds a new document, the crawler 12 registers the new document in the database 11. When the crawler 12 finds that a document is no longer available, then crawler 12 deletes the document from the database 11.
The search engine 10 is provided with a parser 13 that extracts text from the document acquired by the crawler 12 and registered in the database 11, and extracts format information such as paragraph. The parser 13 performs syntactic analysis, and inputs the text and the format information extracted as a result of the syntax analysis to a data structure called a store 14 that stores collected documents.
The search engine 10 is provided with an indexer 15 that creates an index based on the text and the format information extracted by the parser 13. The indexer 15 associates a word serving as a key with an ID of a document including the word as described above, and stores the same in an index 16.
The search engine 10 is further provided with a search run time 17 serving as a search server that searches for a document including a search word as a key in response to a query including the search word received from the client device 30, a query-related information creation device 18 that receives a search result from the search run time 17, acquires a document including the search word from the store 14, and generates a character string including the search word, and a query-related information comparison device 19 that compares the generated character string with the documents in the search result.
The search engine 10 makes the query-related information creation device 18 generate character strings including the search word for each search and for each search result, and makes the query-related information comparison device 19 compare the character strings, thus detecting documents matching with each other as a whole, and documents including several sampled portions matching with each other as related documents.    [Patent Document 1] U.S. Pat. No. 6,230,155    [Patent Document 2] U.S. Pat. No. 6,658,423    [Patent Document 3] U.S. Pat. No. 6,978,419    [Patent Document 3] U.S. Pat. No. 6,615,209
The conventional search engines handle different documents having the same contents as individual search results, thus making it possible to exclude such documents having the same contents or similar contents beforehand at the time of the document collection or the index creation. However, the conventional search engines can only judge that documents or several portions thereof have the same contents or similar contents, but cannot judge that documents have the same contents or similar contents based on partial identity.
When a search word appears in a menu at a Web site, the conventional search engines return all pages including the menu. Although the returned pages can be limited by designating words and character strings that do not appear to be characteristics of a document beforehand, such words and character strings have to be known prior to the designation.
Further, the conventional search engines return a search result without consideration given to a relation between the documents. Therefore, a user is required to make a judgment as to whether all of the documents included in the returned search result are truly required documents or not one by one.