1. Field of the Invention
The present invention relates to information retrieval systems. More particularly, this invention relates to method and apparatus for assisting an operator in forming a phrase query for searching through a library of documents.
2. Description of Related Art
Due to the ever increasing affordability and accessibility of very large, online, text collections, Information Access, the science of processing natural language texts for the purposes of search and retrieval, has been the focus of heightened attention in the last few years, although researchers have been active in the field since the early sixties. Numerous approaches have been attempted, but they all suffer from the obvious difficulty that information access is quintessentially a cognitive task. The degree of automatic language understanding required for a complete solution is clearly outside the bounds of current technology. Instead, heuristic search techniques attempt to match an admittedly incomplete query description with an admittedly incomplete set of features extracted from the texts of interest. The interest therefore lies in the development of procedures that more effectively bridge the gap between an individual's partially stated desires and a universe of text, which appears, computationally, as a sequence of uninterpreted words.
Many of these procedures are statistical in nature. They take advantage of repeated occurrences of the same word to infer relations between documents, and between queries and documents. (A "document" need not correspond to any particular organization. It might be a chapter in a book, a section within a chapter, or an individual paragraph. However, as defined herein, a set of documents forming a corpus is an exhaustive and disjoint partition of that corpus. For example, similarity search induces a "relevance" ordering on the text collection by scoring each document with a normalized sum of importance weights assigned to each word in common between it and the query, where the importance weights depend upon document and collection, or corpus, frequencies. A more formal approach scores documents with their estimated probability of relevance to the query by adopting a text model which assumes word occurrences are sequentially uncorrelated and training on a set of known relevant documents. In contrast, polysemy (one word having multiple senses) and word correlation is directly addressed by Latent Semantic Indexing, which attempts to exact characteristic linear combinations through a singular value decomposition of a word co-occurrence matrix. The availability of interdocument similarity measures suggests clustering, which has been pursued both as an accelerator for conventional search and as a query broadening tool. Finally, linear discriminant analysis has been deployed to classify documents based on a training set which matches features, including word overlap and word positioning, with relevance to previous queries.
Another suite of techniques attempt to enrich the basic feature set by annotating words with their lexical and syntactic functions. For example, fast lookup algorithms from computational linguistics reduce words to their stems. See L. Karttunen et al, "A compiler for two-level phonological rules", Report CSLI-87-108, Center for the Study of Language and Information, 1987. Hidden Markov Modeling has been successfully employed to reintroduce part-of-speech tags given a lexicon with greater than 95% accuracy. See J. Kupiec, "Augmenting a hidden markov model for phrase-dependent word tagging", Proceedings of the 1989 DARPA Speech and Natural Language Workshop, Cape Cod, MA, October 1989. An extension of this technique, known as the inside-outside algorithm, promises a method for inducing a stochastic grammar given sufficient training text. See J. K. Baker, "Trainable grammars for speech recognition", Speech Communication Papers for the 97th Meeting of the Acoustical Society of America, pages 547-550, 1979; and T. Fujisaki et al, "A probabilistic method for sentence disambiguation", Proceedings of the International Workshop on Parsing Technologies, August 1989. Less ambitious procedures aim at robustly extracting noun phrases given a sequence of part-of-speech tags. Word co-occurrence relations have been exploited to order alternatives in cases of lexical ambiguity. Non-parametric classification procedures have been used to detect sentence boundaries in the face of typographic ambiguity. See M. Riley, "Some applications of tree-based modeling to speech and language", Proceedings of the DARPA Speech and Natural Language Workshop, Cape Cod, Mass., pages 339-352, Oct. 1989.
A typical information access scenario involves a corpus of natural language text documents, and a user with an information need (or interest). The task is to satisfy that need (or interest), usually by delivering one of more relevant documents from the corpus of interest. This is accomplished by extracting from each document a feature set, and providing the user with a tool which allows search over these features in some prescribed fashion. For example, a standard boolean search technique assumes the feature set is one or more words extracted from the text of the document, and the query language is boolean expressions involving those words. See IBM Germany, Stuttgart, "Storage and Information Retrieval Systems (STAIRS)", April 1972. Since it is anticipated that the corpus may be very large, construction of a feature index by preprocessing each document is a standard search accelerator. See G. Salton, "Automatic Text Processing", Addison-Wesley, 1989.
Conventional search techniques are cast within a framework that might be referred to as the library automation paradigm. It is presumed that the cost for evaluating a query is sufficiently high that a single iteration must return as high quality, and as complete, a response as possible. This is in keeping with online systems that charge for connect time, and is reflected in evaluation criteria that discount the cost of query formulation and measure the precision and recall levels for the ranked set of documents which is implicitly presumed to be the result. Ironically, the best improvements to date, with respect to these criteria, come from an incremental query reformulation technique, known as relevance feedback. See G. Salton et al, "Improving Retrieval Performance by Relevance Feedback", Journal of American Society for Information Science, 41(4): 288-297, June 1990.
Boolean keyword search is a well-known search technique in information retrieval. Essentially, a set of terms, typically individual words, or word stems, is extracted from the unrestricted text of each document in a larger corpus. Search then proceeds by forming a boolean expression in terms of these keywords which is resolved by finding the set of documents that satisfy that expression. For example, a typical query might consist of the conjunction of two search terms. Documents that contain both terms in any order and any position would then be returned. Disjunction and, less frequently, negation are also likely to be supported.
Unconstrained boolean search represents a document as a set of keywords; sequence information is ignored. Proximity search paradigms modify this representation by placing non-boolean nearness constraints on otherwise standard boolean queries. A proximity operator is introduced that demands that two given search terms occur within some given distance (expressed as a number of characters or a number of words) in order for the basic conjunction to satisfied. For example, the sample query above may be narrowed by requesting that the two search terms appear within one word of each other, in any order or, alternatively, in the given order.
Proximity search enables the user to form phrase-like queries; that is a combination of terms is treated as a search unit. This assumes considerable importance when one recalls that a query is a representation of an "information need". Often the concepts inherent in this information need are not expressible as single words. Instead, phrases and even complete sentences must be employed to fully disambiguate the thought. Conjunctive boolean queries allow for the expression of these sorts of combination, but they also clearly over-generate. Higher precision is achievable by making use of nearness constraints. At the other extreme, complete specification of term order may also be detrimental since it is a property of most natural languages (including English), that phrasal units may be rewritten in multiple ways without a change in meaning. For example "dog's ankle" and "ankle of a dog" express the same concept. Hence, the application of proximity constraints must be strong enough to filter out disconnected occurrences, yet flexible enough to account for trivial language variations.
Text retrieval may be thought of as iterative query refinement. Each stage involves a query specification followed by a resolution. If the results of the query satisfy the information need of the user, the process ends. Otherwise, a new, modified, and presumably more appropriate query must be formulated, and the process iterated. The results of the previous steps inform the query reformulation at each stage.
Traditional applications of boolean or proximity search provide little support for this reformulation process. These traditional applications usually only generate a candidate set of documents which satisfy the search criterion. The user must then judge the effectiveness of the query by perusing these documents, a potentially time consuming operation, especially if document titles are insufficient to disambiguate relevant from non-relevant hits. In fact, there is empirical evidence that boolean searches resolve into two classes, those whose result sets contain only a very few hits (narrow query), and those that result in a great many hits (broad query). In the case of only a few hits, the user is left with the uncomfortable feeling that something may have been missed, which leads to a desire to broaden the existing query. Arriving at an appropriate broadening can be elusive since no particular alternative is suggested by the search results themselves. If the query is over broadened, the user is presented with far too many hits, and the task of separating out the relevant documents from the mass becomes daunting.
The problem here is that the user is provided with little or no assistance in query reformulation. H. P. Frei et al, in "Caliban: Its user-interface and retrieval algorithm", Technical Report 62, Institut fur Informatik, ETH, Zurich, April 1985, the disclosure of which is incorporated herein by reference, discloses a dictionary of available search terms which can aid the search for alternative terminology, as well as an online domain-specific thesaurus. However, often, the help of a highly trained intermediary, such as a research librarian, is required to derive a desirable reformulation.
One solution is to provide enough information about each hit that the user can rapidly determine the contextual usage, and hence arrive at a relevance judgment and, possibly, a reformulation, without necessarily scanning the entire text. Paper-based keyword-in-context indices (sometimes known as permuted indices) offer a solution for the case of single term queries. The user enters the index with a single term, the word of interest, and finds, arranged alphabetically, single lines of context for each instance of that term, with the search term columnated at the center of each line. See H. P. Luhn, "Keyword-in-Context for Technical Literature", ASDD Report RC-127, IBM Corporation, Yorktown Heights, N.Y., August, 1959. The choice of an alphabetic sort key for the lines of context can be less than optimal if one is searching for related phrases containing the search key, since the words determining the phrase will typically be placed around the key, rather than heading the line. An alternative sort key that captures this intuition, has been employed successfully in the generation of an index to titles in Statistics and Probability. See I. C. Ross and J. W. Tukey, "Index to Statistics and Probability Permuted Titles", Volumes 3 and 4 of the Information Access Series, R & D Press, Los Altos, Calif., 1975; also available from the American Mathematical Society, Providence, R.I. Computerized versions of these sorts of indices exist in a variety of different forms, yet few, if any, elaborate on the basic query and display strategy.
Accordingly, a need exists for a search tool which will assist an operator in formulating a search query, particularly when the operator has little information about the corpus of documents which the operator is searching.
U.S. Pat. No. 4,823,306 to Barbic et al discloses a method for retrieving, from a library of documents, documents that match the content of a sequence of query words, and for assigning a relevance factor to each retrieved document. The method comprises the steps of: defining a set of equivalent words for each query word and assigning to each equivalent word a corresponding word equivalence value; locating target sequences of words in a library document that match the sequence of query words in accordance with a set of matching criteria evaluating a similarity value for each of the target sequences of words as a function of the corresponding equivalence values of words included therein; and obtaining a relevance factor for the library document based upon the similarity values of its target sequences.
U.S. Pat. No. 4,972,349 to Kleinberger discloses an interactive, iterative information retrieval and analysis system wherein a "table of contents" organized as a standard outline in some similarly graphic format, is dynamically generated in response to specific search requests. Documents satisfying the search request are categorized based upon the existence of predefined key words therein. The table of contents is organized into key word categories, sub-categories, sub-sub-categories, etc. The analysis process can be repeated for a specific category or sub-category of the table of contents to derive a new table of contents which is more focused and limited.