1. Field of the Invention
The present invention generally relates to document retrieval apparatuses, document retrieval methods, programs, and computer-readable media having the programs embodied therein, and particularly relates to a document retrieval apparatus, a document retrieval method, a program, and a computer-readable medium having the program embodied therein for retrieving a document including a query character string from a set of registered documents.
2. Description of the Related Art
Document retrieval methods of retrieving a desired document from a set of registered documents include a character-string-based retrieval method and a word-based retrieval method.
The character-string-based retrieval method searches for a document including a character string that matches a character string specified by a user (hereinafter referred to as a query character string). In order to increase the speed of a character-string-based retrieval method, a known method utilizes an n-gram index that is prepared in advance by using a n-character set as an index unit. The n-gram index records the identifiers of relevant documents and the position of occurrences in the documents on an index-by-index basis.
The word-based retrieval method searches for a document including a word that matches a character string specified by a user. In order to enhance the speed of word-based retrieval, a known method utilizes a word index that is prepared in advance by using a document word as an index unit. The word index records the identifiers of relevant documents and the position of occurrences in the documents on an index-by-index basis.
Either retrieval method has its own drawbacks. In the case of the character-string-based retrieval method, a search is conducted by ignoring boundaries of words, resulting in search results including documents that are not appropriate in light of the user's intension. For example, use of the query character string “taiden” (electrification) may result in “ketaidenwa” (cellular phone) being retrieved.
In the case of the word-based retrieval method, there is a need to extract words by morphological analysis at the time of generating indexes because word boundaries are not explicitly indicated in the Japanese sentences. At the level of currently available technology, however, morphological analysis is not free from errors. Such error in morphological analysis can be a cause of search error. For example, morphological analysis should convert “tokyotoniarukiyomizudera” into “/to/kyoto/ni/aru/kiyomizudera/”. If an erroneous analysis result “/tokyo/to/ni/aru/kiyomizu/dera/” is produced, the sentence “tokyotoniarukiyomizudera” cannot be retrieved when the query character string is “kyoto”.
In order to avoid problems as described above, the system may be provided with both of these retrieval methods, so as to allow users to select one of the retrieval method according to user needs. Japanese Patent Laid-open Application No. 2000-67070 discloses such a prior-art retrieval method. In this document, special section-mark characters are inserted between words at the time of sentence registration, and an n-gram is extracted from the data having the section-mark characters inserted therein, followed by generating indexes. In so doing, n-grams formed by connecting words across the section-mark character are also extracted and registered as indexes. When a user selects the word-based retrieval method, the n-grams including the section-marks therein are not ignored by the search process. On the other hand, when the character-string-based retrieval method is selected, the n-grams including the section-marks therein are ignored in the search process.
Japanese Patent Laid-open Application No. 7-85033 discloses another prior-art technology. In this technology, documents having characters occurring therein are identified on character-specific basis, and the positions of occurrences in the documents are recorded. Further, a flag is recorded that indicates whether the position of occurrence is at the beginning of a word or at the end of a word. At the time of search, a character-string-based search is achieved based on the positions of character-specific occurrences, and, also, a word-based search is attained by referring to the flag indicative of the beginning or end of words.
The scheme taught by Japanese Patent Laid-open Application No. 2000-67070 has drawbacks as follows. According to this disclosure, the word boundaries are represented by section-mark characters. Since characters are generally represented by fixed length codes (e.g., 2 bytes when Unicode (UCS2) is used), this method is not applicable where any possible values are treated as characters having meanings.
The scheme taught by Japanese Patent Laid-open Application No. 7-85033 has the following drawbacks. Since the character-string-based retrieval operates based on character search, search speed is slow compared with when the n-gram index is used.
Problems common to both schemes are as follows. When a morphological analysis system for separating words is updated (or a dictionary is updated), the position of boundaries may change, which results in the need for regeneration of the entire index. Consequently, maintenance of indexes is time consuming.
In consideration of this, a document-retrieval method has been presented that provides a word-based retrieval by using a word-boundary position index that is different from the n-gram index used for character-string-based retrieval, thereby achieving high-speed retrieval and providing for easy maintenance of index. This method was assigned to the assignee of this application.
When such a word-boundary position index is used, however, the index size may become undesirably large since the word-boundary position index needs to record a large number of positions of word occurrences.
Accordingly, there is a need for a document-retrieval apparatus, a document-retrieval method, and a computer-readable medium having the program recorded therein that can reduce the size of a word-boundary position index.
Also, there is a need for a document-retrieval apparatus, a document-retrieval method, and a computer-readable medium having the program recorded therein that can enhance search speed.