This invention relates to information retrieval, and particularly to document retrieval from a computer database. More particularly, the invention concerns a method and apparatus for creating a search query in natural language for use in an inference network for document identification and retrieval purposes.
Presently, document retrieval is most commonly performed through use of Boolean search queries to search the texts of documents in the database. These retrieval systems specify strategies for evaluating documents with respect to a given query by logically comparing search queries to document texts. One of the problems associated with text searching is that for a single natural language description of an information need, different Boolean researchers will formulate different Boolean queries to represent that need. Because the queries are different, document ranking will be different for each search, thereby resulting in different documents being retrieved.
More recently, hypertext databases have been developed which emphasize flexible organizations of multimedia "nodes" through connections made with user-specified links and interfaces which facilitate browsing in the network. Early networks employed query-based retrieval strategies to form a ranked list of candidate "starting points" for hypertext browsing. Some systems employed feedback during browsing to modify the initial query and to locate additional starting points. Network structures employing hypertext databases have used automatically and manually generated links between documents and the concepts or terms that are used to represent their content. For example, "document clustering" employs links between documents that are automatically generated by comparing similarities of content. Another technique is "citations" wherein documents are linked by comparing similar citations in them. "Term clustering" and "manually-generated thesauri" provide links between terms, but these have not been altogether suitable for document searching on a reliable basis.
Deductive databases have been developed employing facts about the nodes, and current links between the nodes. A simple query in a deductive database, where N is the only free variable in formula W, is of the form {N.vertline.W(N)}, which is read as "Retrieve all nodes N such that W(N) can be shown to be true in the current database." However, deductive databases have not been successful in information retrieval. Particularly, uncertainty associated with natural language affects the deductive database, including the facts, the rules, and the query. For example, a specific concept may not be an accurate description of a particular node; some rules may be more certain than others; and some parts of a query may be more important than others. For a more complete description of deductive databases, see Croft et al. "A Retrieval Model for Incorporating Hypertext Links", Hypertext '89 Proceedings, pp 213-224, November 1989 (Association for Computing Machinery), incorporated herein by reference.
A Bayesian network is one which employs nodes to represent the document and the query. If a proposition represented by a parent node directly implies the proposition represented by a child node, a implication line is drawn between the two nodes. If-then rules of Bayesian networks are interpreted as conditional probabilities. Thus, a rule A.fwdarw.B is interpreted as a probability P(B.vertline.A), and the line connecting A with B is logically labeled with a matrix that specifies P(B.vertline.A) for all possible combinations of values of the two nodes. The set of matrices pointing to a node characterizes the dependence relationship between that node and the nodes representing propositions naming it as a consequence. For a given set of prior probabilities for roots of the network, the compiled network is used to compute the probability or degree of belief associated with the remaining nodes.
An inference network is one which is based on a plausible or non-deductive inference. One such network employs a Bayesian network, described by Turtle et al. in "Inference Networks for Document Retrieval", SIGIR 90, pp 1-24 September 1990 (Association for Computing Machinery), incorporated herein by reference. The Bayesian inference network described in the Turtle et al. article comprises a document network and a query network. The document network represents the document collection and employs document nodes, text representation nodes and content representation nodes. A document node corresponds to abstract documents rather than their specific representations, whereas a text representation node corresponds to a specific text representation of the document. A set of content representation nodes corresponds to a single representation technique which has been applied to the documents of the database.
The query network of the Bayesian inference network described in the Turtle et al. article employs an information node identifying the information need, and a plurality of concept nodes corresponding to the concepts that express that information need. A plurality of intermediate query nodes may also be employed where multiple queries are used to express the information requirement.
The Bayesian inference network described in the Turtle et al. article has been quite successful for general purpose databases. However, it has been difficult to formulate the query network to develop nodes which conform to the document network nodes. More particularly, the inference network described in the Turtle et al. article did not use domain-specific knowledge bases to recognize phrases, such as specialized, professional terms, like jargon traditionally associated with specific professions, such as law or medicine.
For a more general discussion concerning inference networks, reference may be made to Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference by J. Pearl, published by Morgan Kaufmann Publishers, Inc., San Mateo, Calif., 1988, and to Probabilistic Reasoning in Expert Systems by R. E. Neapolitan, John Wiley & Sons, New York, N.Y., 1990.
Prior techniques for recognizing phrases in an input query employed syntactic and statistical analysis and manual selection techniques. The present invention employs an automated domain-specific knowledge based system to recognize phrases.