1. Field of the Invention
The present invention relates to a document processing apparatus for searching documents, a control method therefor, a program for implementing the method, and a storage medium storing the program, and more particularly to a document processing apparatus for searching documents based on a plurality of search methods, a control method therefor, a program for implementing the method, and a storage medium storing the program.
2. Description of the Related Art
As a basic search method for searching a desired document (document data), there is conventionally known a keyword-based search which performs a search based on whether or not a given keyword or keywords (search query) is contained in the document. However, with the keyword-based search, it is difficult to quickly retrieve a desired document. Therefore, various other search methods and search engines have been devised.
The devised search engines for searching a desired document include a search engine which uses the relation between keywords or a degree of similarity in syntax information, and a search engine which uses a document vector characterizing the content of a document. As the search engine which uses the document vector, a search engine has been proposed, which uses a vector representation in terms of feature amounts corresponding to respective dimensions (classes) classified by meaning, field, or word of the content of a document, to determine a degree of similarity between documents by using an inner product (scalar product) of vectors of the respective documents, and retrieves a desired document based on the degree of similarity. Further, a document searching apparatus has been proposed, which has a plurality of search engines using various search methods installed therein, performs searches by switching over the plurality of search engines, and/or performs a comprehensive search based on results of search by the plurality of search engines.
Moreover, a search method has been proposed, which divides a given keyword into partial character strings each having n characters, and searches a document which includes all the partial character strings, to thereby narrows the scope of search (see Japanese Laid-Open Patent Publication (Kokai) No. H05-174064).
Also, a technique has been proposed, which merges a last sentence of a first text block and a head (first) sentence of a second text block, which is likely to be a continuation part of the last sentence of the first text block, into a merged character string for each pair of text blocks taken from a document with layout information, performs a morphological analysis on the merged character string, evaluates naturalness of the merged character string, to thereby determine the most natural connection order of the text blocks, and rearrange the text blocks according to the determined connection order (see Japanese Laid-Open Patent Publication (Kokai) No. H11-015826).
However, according to the above proposed document searching apparatus which searches documents based on a plurality of search methods, in spite of the fact that documents (document contents, kind of document, etc.) that can be retrieved efficiently and accurately are varied depending on individual search engines or search methods, an index for search is created based the entire document as a single object to be searched, regardless of which search engine or method is used for the search.
Therefore, when the object to be searched is a document containing a plurality of topics, the conventional document vector-based search engine cannot accurately retrieve that object, using the index created from the entire document as a single object to be searched. Further, none of the conventional keyword-based, keyword relation-based, and syntax information-based search engines can quickly retrieve documents containing large amounts of information.