1. Field of the Invention
The present invention relates to a document retrieval apparatus, which rapidly retrieves documents including plural words in an order specified by its user by using a relatively small amount of index thereof.
2. Description of the Prior Art
There are known documents retrieval methods for retrieving required documents from within a large amount of documents. One well-known method registers the words included in these documents into an index prior to query-retrieval and uses this index to perform a faster retrieval task.
One example of such a method is retrieval of words from within plural documents. Thus an index is prepared in addition to documents in order to register every word appeared in the documents and pointers to the document that each word is contained in, prior to retrieval. When retrieving, by inputting a word as a retrieval condition, the pointer pointing to the document containing the input word is retrieved from the index to output the appropriate document.
In this method, however, all documents containing the word specified as a retrieval condition will be retrieved, resulting in a problem that many other documents not intended to be retrieved will be included in the retrieval result. Furthermore, narrowing the number of retrieved documents by querying the documents matching with plural words in the retrieval condition does not eliminate the above problem since the relationship between query keywords cannot be specified.
In the Japanese Published Unexamined Patent Application No. Hei 08-249346 discloses a document retrieval apparatus using an adjoining index, which indicates an order or keywords. In accordance with the document retrieval apparatus disclosed as above, a retrieval considering the relationships between two keywords input as query condition may be performed.
The above apparatus generally uses morpheme analysis technology, which has been developed in the field of natural language processing in order to extract words to be registered in an index from the documents to be processed. When using the concurrent morpheme analysis technology, a document may or may not be disassembled into the word strings in an accurate and univocal manner. For example, when performing morpheme analysis on the text xe2x80x9cHIRO EN KAIJO GAI (outside banquet site)xe2x80x9d, there will be more than one result such as xe2x80x9cHIRO | EN | KAIJO | GAIxe2x80x9d, xe2x80x9cHIROEN | KAIJO | GAIxe2x80x9d, xe2x80x9cHIRO | ENKAI | JOGAIxe2x80x9d, and xe2x80x9cHIRO | ENKAIJO | GAIxe2x80x9d, where xe2x80x9c|xe2x80x9d designates to a break between two words. In such analysis, text strings may be split at different breakpoints for the same description.
In the index used in the document retrieval apparatus as described above, since adjoining words may be limited to only one, an index having the structure corresponding to the respective of results of morpheme analysis should be provided, resulting in the index size being enormously large.
Japanese Published Unexamined Patent Application No. Hei 08-249354 discloses a document retrieval apparatus that stores the location of words in a document into the index. In accordance with this document retrieval apparatus, the resulting words may be registered together into an index, even if plural breakpoints are obtained for the same word or different word classes are presumed for this same string.
In this apparatus, there also arise the problem that the number of words to be registered in the index is so enormous that the amount of index cannot be ignored.
The above-described situation may be happen to any natural languages, but it is particularly noticeable in Japanese, in which the breakpoints between words are not clearly articulated when compared to Indo-European languages.
As can be seen from the above description, a document retrieval apparatus of the Prior Art for full text retrieval search using an index requires a large capacity of memory for loading a huge amount of index as well as a long time for index searching and therefore overall retrieval performance may be decreased, This problem may be significant for example in Japanese full text retrieval search, since breakpoints between words are not clear in Japanese. The number of words to be registered in the index will be larger in Japanese than that of Indo-European languages. If an index is to be arranged on a character basis rather than a word basis, in order to avoid the problem of the breakpoint of words, the number of entries to be registered in the index will be so large that the index size will be inflated.
The present invention has been made in view of the above circumstances and provides a document retrieval apparatus, which performs full text retrieval search of documents by using an index of relatively small amount of size for not only Indo-European documents but also Japanese documents in which breakpoints of words are not clearly articulated.
The present invention also provides a document retrieval apparatus that performs retrieval search, without registered data on the full text of documents, by only using the index by considering the relationship of words, and that outputs the reconstructed full text of documents based on the retrieval result.
The present invention further provides a document retrieval apparatus that stores information on the word class into a small size index, and perform fast retrieval search by using the comparison of the word class information. In other words, the present invention is to provide a document retrieval apparatus that performs fast retrieval search using the index storing the results of morpheme analysis on the documents in its relatively small size.
The document retrieval apparatus in accordance with the present invention has a word storing part that eliminates the redundancy of every word included in a document, and stores these words with additional information on adjoining words next to the word in the document, and a retrieval search part that determines, based on the retrieval criteria including plural words and the disposition of words, the correspondence of the retrieval criteria to plural words stored in the word storing part, in order to check to see whether or not a document matches with the retrieval criteria, i.e., whether or not a document containing the contents corresponding to the input criteria may be retrieved by the retrieval search.
More specifically, the word storing part constituting the index stores said every word by identifying its address in said word storing part, also stores said adjoining words immediately after the word, and additionally stores the addresses of stored adjacent words next to said adjoining words as information on said words in a predetermined order to indicate the word order in a document as the link of addresses in order to eliminate the redundant words to arrange an index of relatively small size.
The document retrieval apparatus in accordance with the present invention may be carried out in a variety of modes. As will be described in the following embodiments, the document retrieval apparatus in accordance with the present invention may be achieved by constituting the index in the word storing part as a trial form, by constituting the index commonly shared for every word in plural documents, by constituting the index so as to store two synonymous words of different forms between original and conjugated forms by connecting them with their addresses, or by storing words in the index with word class information being tagged so as for the document retrieval part to be able to determine the matching to the retrieval criteria based on the criteria including the word class information.
In the document retrieval apparatus in accordance with the present invention, a document output part outputs plural words determined to be matched by the document retrieval part in the order of tracing the addresses in the index so as to restore the documents matching the criteria.
In accordance with the present invention, full text of the documents retrieved may be restored and supplied without the provision of full text of registered documents other than the index, thus the amount of memory required for the full text of documents to be stored may be reduced.
The document retrieval apparatus in accordance with the present invention stores the information for specifying a document (pointer) and words included in the document into the document indexing part prior to retrieval search. When performing a retrieval search, the document identifying part identifies the documents containing all of the words included in the retrieval criteria from within plural documents stored in the document indexing part, based on the retrieval criteria having plural words and the order thereof. Then a document retrieving part uses the corresponding words storing part to perform a retrieval search on the set of documents matching to the plural words and the order thereof included in the retrieval criteria, from within the set of documents obtained from the document identifying part.
When an index (word storing part) is made for each of plural documents or of plural document sets, the documents matching to the retrieval criteria may be identified. The retrieval search will be efficiently achieved since the retrieval search is performed by using the index corresponding to these identified documents.
The present invention may be carried out by executing on a computer the program achieving the functionality of document retrieval apparatus as described above. More specifically, the index as described above may be stored on a storage medium such as CD-ROM and installed or accessed by a computer for the full text retrieval search as described above.
It is anticipated that the retrieval result will be as intended by the searcher if the relationship between words or the grammatical role can be considered at the time of retrieval search.
In order to perform a retrieval search while considering the relationship between words, the pointer to the document where a word appears and the location of a word in the document should be registered in the index. When performing a retrieval search, the pointer to the document that satisfies the retrieval criteria may be identified from the index by receiving the plural words and the relationship thereof as the retrieval criteria.
When simply implementing such a method as above, a large index will be needed in addition to the full text of the documents. If a document is to be identified based on a word, it will be sufficient to maintain a pointer to the document for each of different words in the document. However, if the document and the location where a word appears in that document are to be identified based on a word, it will be required to maintain the location of the word appeared in the document in addition to the pointer to the document for each word appeared in the document.
In this point of view, the present invention retrieves, restores and outputs the appropriate document without having full text data of the document, so as to prevent the index from being larger due to the extensive quantity of pointers, as is the case cited above.
It may be considered that the documents in which breakpoints between words are not clear such as Japanese documents can be retrieved by building an index on a character basis, instead of a word basis. For example, words have to be registered in the index for identifying the document and the location of the character in the document based on the character string. More specifically, by providing an index in addition to the full text of documents, every character appeared in the documents, the pointer to a document in which a character appears, and the location of the character in that document should be registered as a set. A retrieval search is performed by receiving a character string as the retrieval criteria, and determining the pointer to the document that contains each character constituting the character string in a specified order of appearance of the characters.
When simply implementing such a method as above, the size of a needed index will be extensively large. this is because the total number of characters in a document is much larger than the number of different words or the total number of words, and a total amount of information about the location of words in a document is especially increased.
In this point of view, the present invention builds a word-based index, so as to prevent the index from being large as is the case cited above.
In addition, the use of morpheme analysis allows words in a document to be extracted, albeit breakpoints of the words in the document are not clearly articulated such as in Japanese document, as well as it allows information on the word class for each of words to be added to the index. By doing this, the index may be readily generated and the retrieval search using the index may be performed faster.
In the case where the morpheme analysis technology is used, a document may or may not be disassembled into the word strings in an accurate end univocal manner. As shown in the example of document xe2x80x9cHIROENKAIJOGAI (outside banquet site)xe2x80x9d cited above, words may be split at different breakpoints for the same description. Therefore if an index is created by simply making use of the result of morpheme analysis, the number of different words for the same text string will become enormous, the size of the index also will become corpulent, and the index search using this index will become slower.
In the case where there are not registered as many words as the case above into the index, if a searcher desires an index search using some words, there will occur a discrepancy between the word intended by the searcher and the word registered in the index of the retrieval search system. For example, when the searcher wishes to search and retrieve a document containing the string xe2x80x9ccoffee beans with sucked dregsxe2x80x9d, the searcher intends that the xe2x80x9csucked dregsxe2x80x9d is the word of search criteria whereas the retrieval search system may have xe2x80x9csuckxe2x80x9d and xe2x80x9cdregxe2x80x9d registered in the index. Conversely, the searcher specifies xe2x80x9cdregxe2x80x9d as the search word whereas the retrieval search system may have only xe2x80x9csucked dregsxe2x80x9d registered in the index. In both cases, the search fails to prevent the appropriate document from being retrieved correctly.
In this point of view, the present invention may generate the index by maintaining the order of words while eliminating the redundant words so as to be able to register as many words as possible in the index of smaller size in order to perform faster index search as well as to achieve a complete, error-free retrieval search as intended by the searcher.
By performing the morpheme analysis, not only may a document be split into component words but also the word class may be presumed.
However, the presumed word class may or may not always be correct and univocal. For example, in the document containing xe2x80x9ccoffee beans along with sucked dregsxe2x80x9d, xe2x80x9csuckedxe2x80x9d can be presumed to belong to the noun xe2x80x9csuckxe2x80x9d as well as to the verb xe2x80x9csuckxe2x80x9d. It is preferable that the retrieval search hits to the above document when the searcher specifies in the criteria the word xe2x80x9csuckxe2x80x9d, which is not explicitly included in the document.
In this point of view, the present invention may generate an index by maintaining word class information in a relatively smaller size, so as to be able to perform index search as intended by the searcher, by using information of plural different word classes about the word in the same text string.