The present invention relates to a method and apparatus for generating responses to queries to a document retrieval system. When a large corpus (database) of documents is searched for relevant terms (query terms), it is desirable to find small relevant passages of text (called xe2x80x9chitsxe2x80x9d or xe2x80x9chit passagesxe2x80x9d) and rank them according to an estimate of the degree to which they will providing the information sought.
If the document database is very large, the number of hit passages generated may be far too high to be helpful to the user. Mechanisms are needed to minimize the number of hit passages that a user must examine before he or she either has found the desired information or can reasonably conclude that the information sought is not in the collection of texts.
This type of specific, xe2x80x9cfine-grainedxe2x80x9d information access is becoming increasingly important for on-line information systems and is not well served by traditional document retrieval techniques. The problem is exacerbated with the use of small queries (of only a few words), which tend to generate larger numbers of retrieved documents.
When both the query and the size of the target (hit) passage are small, one of the challenges in current systems is that of dealing effectively with the paraphrase variations that occur between the description of the information sought and the content of the text passages that may constitute suitable answers. Literal search engines will not return paraphrases, and therefore may miss important and relevant information. Search engines that allow paraphrases may generate too many responses, often without an adequate hierarchical ranking, making the query response of minimal usefulness.
Thus, another challenge which is not currently well met is the effective ranking of the resulting hit passages. A high-quality ranking of matching document locations in response to queries is needed to enhance efficient information access.
Classical information retrieval (also called xe2x80x9cdocument retrievalxe2x80x9d) measures a query against a collection of documents and returns a set of xe2x80x9cretrievedxe2x80x9d documents. A useful variant (called xe2x80x9crelevance rankingxe2x80x9d) ranks the retrieved documents in order of estimated relevance to the query, usually by some function of the number of occurrences of the query terms in the document and the number of occurrences of those same terms in the collection as a whole.
Document retrieval techniques do not, however, attempt to identify specific positions or passages within the retrieved documents where the desired information is likely to be found. Thus, when a retrieved document is sufficiently large and the information sought is specific, a substantial residual task remains for the information seeker; it is still necessary to scan the retrieved document to see where the information sought might be found, if indeed the desired information is actually present in the document. A mechanism is needed to address this shortcoming.
In most previous information retrieval procedures for passage retrieval, a passage granularity is chosen at indexing time and these units are indexed and then either retrieved as if they were small documents or collections of individual sentences are retrieved and assembled together to produce passages. See Salton et al., xe2x80x9cApproaches to Passage Retrieval in Full Text Information Systems,xe2x80x9d Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 93) (incorporated herein by reference), ACM Press, 1993, pp 49-58; Callan, J. P., xe2x80x9cPassage-Level Evidence in Document Retrieval,xe2x80x9d Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR 93) (also incorporated herein by reference), Springer-Verlag, 1994, pp 302-310; and Wilkinson, R., xe2x80x9cEffective Retrieval of Structured Documents,xe2x80x9d (also in Proceedings of the Seventeenth, etc., at pp 311-317). It would be useful to have a system that dynamically sized passages for retrieval based upon the degree to which the retrieved passage matches the query phrase.
Recently, a different approach has been proposed, based upon hidden Markov models and capable of dynamically selecting a passage. See Mittendorf et al., xe2x80x9cDocument and Passage Retrieval Based on Hidden Markov Models,xe2x80x9d (Proceedings of the Seventeenth, etc., pp 318-327). However, this approach does not deal with the entire vocabulary of the text material, and requires reducing the document descriptions to clusters at indexing time. It would be preferable to have a system that both encompasses the entire text base and does not require such clustering.
The present invention is directed to a method and apparatus for generating responses to queries with more efficient and useful location of specific, relevant information passages within a text. The method locates compact regions (xe2x80x9chit passagesxe2x80x9d) within a text that match a query to some measurable degree, such as by including terms that match terms in the query to some extent (xe2x80x9c(entailing) term hitsxe2x80x9d), and ranks them by the measured degree of match. The ranking procedure, referred to herein as xe2x80x9crelaxation rankingxe2x80x9d, ranks hit passages based upon the extent to which the requirement of an exact match with the query must be relaxed in order to obtain a correspondence between the submitted query and the retrieved hit passage. The relaxation mechanism takes into account various predefined xe2x80x9cdimensionsxe2x80x9d (measures of closeness of matches), including: word order; word adjacency; inflected or derived forms of the query terms; and semantic or inferential distance of the located terms from the query terms.
The system of the invention locates occurrences of terms (words or phrases) in the texts (document database) that are semantically similar to terms in the query, so as to identify compact regions of the texts that contain all or most of the query terms, or terms similar to them. These compact regions are ranked by a combination of: their compactness; the semantic similarity of the located phrases to the query terms; the number of query terms actually found (i.e. matched with some located term from the texts); and the relative order of occurrence of the located terms compared with the order or the corresponding query terms.
The identified compact regions are called xe2x80x9chit passages,xe2x80x9d and their ranking is weighted to a substantial extent based upon the physical distance separating the matching terms (compared with the distance between the corresponding terms in the query), as well as the xe2x80x9csimilarityxe2x80x9d distance between the terms in the hit and the corresponding terms in the query.
The foregoing criteria are weighted and the located passages are ranked based upon scores generated by combining all the weights according the a predetermined procedure. xe2x80x9cWindowsxe2x80x9d into the documents (variably sized regions around the located xe2x80x9chit passagesxe2x80x9d) are presented to the user in an order according to the resulting ranking.
A significant advantage of relaxation ranking is that the system automatically generates and ranks hits that in a traditional document retrieval system would have to found by a sequence of searches using different combinations of retrieval operators. Thus, the number of times the information seeker is unsatisfied by a resultxe2x80x94and therefore needs to reformulate the queryxe2x80x94is significantly reduced, and the amount of effort required to formulate the query is also significantly reduced.
Another advantage is that the rankings produced by the current system are for the most part insensitive to the size or composition of the document collection and are meaningful across a group of collections, so that term hit lists produced by searching different collections can be merged, and the ranking scores from the different collections will be commensurate. This makes it possible to parallelize and distribute the indexing and retrieval process.
In addition, the system of the invention is more successful than traditional system at locating specific, relevant passages within the retrieved documents, and summarizes and displays these passages with information generated by the relaxation ranking procedure, so that the user is informed why the passage was retrieved and can thus judge whether and how to examine the hit passage.
The present invention has proven to be particularly effective at handling short queries, such as from two to six words. Accordingly, the retrieval system of the invention may handle different queries differently, using a conventional word search mechanism for searches based upon one-word queries or queries of more than six terms, and using the system of the invention for searched based upon two- to six-word queries.