1. Field of the Invention
The present invention generally relates to a device, a method, and a memory medium having a program embodied therein for document retrieval.
2. Description of the Related Art
Document-retrieval techniques retrieve documents including a query character string from a document database. One of such document-retrieval techniques is a likely-relevance retrieval scheme, which retrieves documents that include character strings resembling a query character string.
The likely-relevance retrieval technique is disclosed, for example, in the Japanese Patent Laid-open Application No. 11-85776. This technique calculates ranking scores of partial character strings that are part of a query character string based on the frequency of occurrences, and searches for the query character string in the document by using the obtained ranking scores.
Another example of the likely-relevance retrieval technique is found in xe2x80x9cDevelopment and Evaluation of Full-Document-Based Retrieval System xe2x80x98Retrieval Expressxe2x80x99,xe2x80x9d Proceedings of the Third Annual Meeting of the Association for Natural Language Processing, pp. 361-364, March, 1997. This technique obtains frequency of occurrences of a query character string in a document by obtaining all positions of such occurrences in the document based on occurrences of partial character strings, and calculates a ranking score of the query character string in respect of the document.
The technique disclosed in the above patent laid-open application, however, merely searches for a query character string in a single document, and cannot be used to retrieve a document including a query character string from a plurality of documents.
Further, the longer the query character string, the larger the number of partial character strings that are to be taken into account in the search. Also, the longer the query character string, the larger the number of document segments that are to be processed for calculation of ranking scores. This results in an increase in retrieval time. For example, when a query character string is xe2x80x9cABCDEFxe2x80x9d (each capital letter represents a single Japanese character for the sake of explanation), and partial character strings each comprised of 2 characters are used as a unit of processing, one can extract five partial character strings, i.e., xe2x80x9cABxe2x80x9d, xe2x80x9cBCxe2x80x9d, xe2x80x9cCDxe2x80x9d, xe2x80x9cDExe2x80x9d, and xe2x80x9cEFxe2x80x9d. In general, when a query character string is comprised of m characters, and n characters constitute a unit of processing, one can extract (mxe2x88x92n+1) partial character strings. Since the ranking score needs to be calculated at every position where at least one of extracted partial character strings appears, the number of positions that require computation increases as the number of partial character strings increases.
A ranking score of a partial character string in the document is calculated based on frequency of occurrences of the partial character string in the document. Some of the partial character strings appearing in the document may have no bearing on the query character string, yet such occurrences are counted toward the ranking scores. This reduces accuracy of the search. For example, the query character string xe2x80x9cABCDEFxe2x80x9d may appear only once in a given document, and another character string xe2x80x9cWXYZEFxe2x80x9d that has a totally different meaning may appear many times in this document. In such a case, the partial character string xe2x80x9cEFxe2x80x9d appears as many times as the number of occurrences of xe2x80x9cABCDEFxe2x80x9d plus the number of occurrences of xe2x80x9cWXYZEFxe2x80x9d. As a result, the ranking score of the partial character string xe2x80x9cEFxe2x80x9d ends up being inappropriately high despite the rare occurrence of the query character string, resulting in an inappropriately high ranking score for the query character string.
Another problem is that search cannot be conducted if the length of a query character string is shorter than a unit of processing. This is because the query character string cannot be divided into partial character strings having the length of the unit of processing. For example, if the query character string is xe2x80x9cBxe2x80x9d, and two characters constitute a unit of processing, the search of this method cannot be performed since the query character string is shorter than the unit of processing.
The technique disclosed in xe2x80x9cDevelopment and Evaluation of Full-Document-Based Retrieval System xe2x80x98Retrieval Expressxe2x80x99,xe2x80x9d Proceedings of the Third Annual Meeting of the Association for Natural Language Processing, pp. 361-364, March, 1997 has the same problem as the technique disclosed in the above patent laid-open application. That is, the amount of computation for counting occurrences of a query character string in a document increases as the length of the query character string increases, resulting in lengthening of a processing time for document retrieval. The larger the number of occurrences of a query character string, the more conspicuous the increase in the processing time for document retrieval.
Accordingly, there is a need for a retrieval scheme that can retrieve a document easily at high speed.
There is another need for a retrieval scheme in which the computation load of selecting a document and calculating ranking scores can be reduced, thereby achieving high-speed processing.
There is another need for a retrieval scheme that is free from an influence of other character strings having no relevance to a query character string, thereby improving retrieval accuracy.
There is another need for a retrieval scheme in which the computation load of obtaining positions of occurrences of a query character string can be reduced, thereby achieving high-speed document retrieval.
There is another need for a retrieval scheme in which the number of score searches can be reduced, thereby boosting a search speed.
There is another need for a retrieval scheme that can retrieve a document even if the length of a query character string is shorter than a unit of processing.
There is another need for a retrieval scheme in which the computation load of calculating ranking scores is reduced, thereby achieving high-speed retrieval.
It is a general object of the present invention to provide a document-retrieval scheme that substantially obviates one or more of the problems caused by the limitations and disadvantages of the related art.
Features and advantages of the present invention will be set forth in the description which follows, and in part will become apparent from the description and the accompanying drawings, or may be learned by practice of the invention according to the teachings provided in the description. Objects as well as other features and advantages of the present invention will be realized and attained by a method and a device for document retrieval particularly pointed out in the specification in such full, clear, concise, and exact terms as to enable a person having ordinary skill in the art to practice the invention.
To achieve these and other advantages and in accordance with the purpose of the invention, as embodied and broadly described herein, the invention provides a method for document retrieval comprising the steps of dividing a query character string into partial character strings, selecting one or more documents from a plurality of registered documents such that the one or more documents each include all the partial character strings, computing respective scores of the partial character strings for each of the one or more documents, and computing a score of the query character string from the respective scores of the partial character strings for each of the one or more documents.
In the method described above, the one or more documents that include the partial character strings resembling the query character string are selected prior to the computation of scores. Because of this screening process, the high-speed document retrieval can be achieved to retrieve a document from the plurality of registered documents.
According to one aspect of the present invention, the method as described above is such that the step of dividing divides the query character string into the partial character strings that generally do not overlap and that cover a full length of the query character string.
In the method described above, the computation load of selecting the one or more documents and computing scores can be reduced, thereby attaining high-speed document retrieval.
According to another aspect of the present invention, the method described first in the above is such that the step of computing respective scores of the partial character strings includes the steps of obtaining a first count indicating how many of the registered documents include a given one of the partial character strings, obtaining second counts each indicating how many times a corresponding one of the partial character strings appears in a given one of the one or more documents, obtaining the smallest of the second counts, and obtaining a score of the given one of the partial character strings for the given one of the one or more documents from the first count and the smallest of the second counts such that the score of the given one of the partial character strings increases as the first count decreases and as the smallest of the second counts increases.
In the method described above, influence of irrelevant occurrences of the partial character strings can be reduced when computing scores, thereby improving retrieval accuracy.
According to another aspect of the preset invention, the method described first in the above is such that the step of computing respective scores of the partial character strings includes the steps of obtaining a first count indicating how many of the registered documents include a given one of the partial character strings, obtaining a second count indicating how many times the query character string appears in a given one of the one or more documents, and obtaining a score of the given one of the partial character strings for the given one of the one or more documents from the first count and the second count such that the score of the given one of the partial character strings increases as the first count decreases and as the second count increases.
In the method described above, influence of irrelevant occurrences of the partial character strings within a document can be eliminated when computing scores, thereby improving retrieval accuracy.
According to another aspect of the present invention, the method described above is such that the step of obtaining a second count further includes a step of placing an upper limit on the second count.
In the method described above, the computation load of detecting positions of the query character string can be reduced, thereby helping to achieve high-speed document retrieval.
According to another aspect of the present invention, the method described first in the above is such that the step of selecting one or more documents selects the one or more documents each of which includes the query character string, and the step of computing respective scores of the partial character strings includes the steps of obtaining a first count indicating how many of the registered documents include the query character string, obtaining a second count indicating how many times a given one of the partial character strings appears in a given one of the one or more documents, and obtaining a score of the given one of the partial character strings for the given one of the one or more documents from the first count and the second count such that the score of the given one of the partial character strings increases as the first count decreases and as the second count increases.
In the method described above, influence of irrelevant occurrences of the partial character strings across different documents can be eliminated, thereby contributing to improved accuracy of document retrieval.
According to another aspect of the present invention, the method described first in the above is such that the step of selecting one or more documents selects the one or more documents each of which includes the query character string, and the step of computing respective scores of the partial character strings includes the steps of obtaining a first count indicating how many of the registered documents include the query character string, computing a limit from the first count, obtaining a second count indicating how many times the query character string appears in a given one of the one or more documents while limiting an upper end of the second count to the limit, and obtaining a score of a given one of the partial character strings for the given one of the one or more documents from the first count and the second count such that the score of the given one of the partial character strings increases as the first count decreases and as the second count increases.
In the method described above, influence of irrelevant occurrences of the partial character strings can be eliminated, and the computation load of detecting positions of the query character string can be reduced, thereby contributing to achieve accurate and high-speed document retrieval.
According to another aspect of the present invention, a method for document retrieval includes the steps of providing respective indexes for documents, each of the respective indexes listing partial character strings found in a corresponding document and respective positions thereof in the corresponding document, selecting the partial character strings which start with a character string identical to a query character string, selecting one or more documents from the documents such that the one or more documents each include at least one of the selected partial character strings, computing respective scores of the selected partial character strings for each of the one or more documents, and computing a score of the query character string from the respective scores of the selected partial character strings for each of the one or more documents.
In the method described above, appropriate document retrieval can be attended to even when the query character string is shorter than a length of the partial character strings.