This invention relates to information retrieval, and particularly to document retrieval from a computer database using probability techniques. More particularly, the invention concerns a method and apparatus for establishing probability thresholds in probabilistic information retrieval systems and for estimating representation frequencies in document databases for representations having no pre-computed frequency.
There are, in theory, two categories of information retrieval systems: algebraic systems and probabilistic systems. Algebraic systems logically match terms and their positions in a stored information (such as a document) to terms in a query; Boolean systems are examples of algebraic systems. Probabilistic systems match representations (concepts) in a stored information to concepts in a query to retrieve information based on probabilities rather than algebraic or Boolean logic.
Presently, document retrieval is most commonly performed through use of Boolean search queries to search the texts of documents in the database. These retrieval systems specify strategies for evaluating documents with respect to a given query by logically comparing search queries to document texts. One of the problems associated with text searching is that for a single natural language description of an information need, different Boolean researchers will formulate different Boolean queries to represent that need. Because the queries are different, different documents will be retrieved for each search.
Another difficulty with Boolean systems is that all documents meeting the query are retrieved, regardless of number. If an unmanageable number of documents are retrieved, the searcher must reformulate the search query to more narrowly define the information need, thereby narrowing the retrieved documents to a more manageable number. However, in narrowing the search, the researcher risks missing relevant documents partially meeting the information need. Moreover, Boolean systems will not retrieve documents only partially meeting the query, which themselves are often important secondary documents to the query.
More recently, probabilistic systems employing hypertext databases have been developed which emphasize flexible organizations of multimedia "nodes" through connections made with user-specified links and interfaces which facilitate browsing in the network. Early networks employed query-based retrieval strategies to form a ranked list of candidate "starting points" for hypertext browsing. Some systems employed feedback during browsing to modify the initial query and to locate additional starting points. Network structures employing hypertext databases have used automatically and manually generated links between documents and the concepts or terms that are used to represent their content. For example, "document clustering" employs links between documents that are automatically generated by comparing similarities of content. Another technique is "citations" wherein documents are linked by comparing similar citations in them. "Term clustering" and "manually-generated thesauri" provide links between terms, but these have not been altogether suitable for document searching on a reliable basis.
Deductive databases have been developed employing facts about the nodes, and current links between the nodes. A simple query in a deductive database, where N is the only free variable in formula W, is of the form {N.vertline.W(N)}, which is read as "Retrieve all nodes N such that W(N) can be shown to be true in the current database." However, deductive databases have not been successful in information retrieval. Particularly, uncertainty associated with natural language affects the deductive database, including the facts, the rules, and the query. For example, a specific concept may not be an accurate description of a particular node; some rules may be more certain than others; and some parts of a query may be more important than others. For a more complete description of deductive databases, see Croft et al. "A Retrieval Model for Incorporating Hypertext Links", Hypertext '89 Proceedings, pp 213-224, November 1989 (Association for Computing Machinery), incorporated herein by reference.
A Bayesian network is a probabilistic network which employs nodes to represent the document and the query. If a proposition represented by a parent node directly implies the proposition represented by a child node, an implication line is drawn between the two nodes. If-then rules of Bayesian networks are interpreted as conditional probabilities. Thus, a rule A.fwdarw.B is interpreted as a probability P(B.vertline.A), and the line connecting A with B is logically labeled with a matrix that specifies P(B.vertline.A) for all possible combinations of values of the two nodes. The set of matrices pointing to a node characterizes the dependence relationship between that node and the nodes representing propositions naming it as a consequence. For a given set of prior probabilities for roots of the network, the compiled network is used to compute the probability or degree of belief associated with the remaining nodes.
An inference network is one which is based on a plausible or non-deductive inference. One such network employs a Bayesian network, described by Turtle et al. in "Inference Networks for Document Retrieval" SIGIR 90, pp. 1-24 Sep. 1990 (Association for Computing Machinery), incorporated herein by reference. The Bayesian inference network described in the Turtle et al. article comprises a document network and a query network. The document network represents the document collection and employs document nodes, text representation nodes and content representation nodes. A document node corresponds to abstract documents rather than their specific representations, whereas a text representation node corresponds to a specific text representation of the document. A set of content representation nodes corresponds to a single representation technique which has been applied to the documents of the database.
The query network of the Bayesian inference network described in the Turtle et al. article employs an information node identifying the information need, and a plurality of concept nodes corresponding to the concepts that express that information need. A plurality of intermediate query nodes may also be employed where multiple queries are used to express the information requirement.
The Bayesian inference network described in the Turtle et al. article has been quite successful for small, general purpose databases. However, it has been difficult to formulate the query network to develop nodes which conform to the document network nodes. More particularly, the inference network described in the Turtle et al, article did not use domain-specific knowledge bases to recognize phrases, such as specialized, professional terms, like jargon traditionally associated with specific professions, such as law or medicine.
One important aspect to probabilistic retrieval networks, such as a Bayesian inference network, is the identification of the frequency of occurrence of a representation in each document and in the entire document collection. A representation that occurs frequently in a document is more likely to be a good descriptor of that document's content. A representation that occurs infrequently in the collection is more likely to be a good discriminator than one that occurs in many documents. Consequently, when creating a database for a probabilistic network, care is taken to identify the representations (content concepts) in the documents, as well as their frequencies. However, it is not always possible to identify certain representations (such as phrases, proximities and thesaurus or synonym classes) or their frequency when creating the database. More particularly, phrases are usually comprised of multiple words which themselves are individual concepts or representations. The concept or representation of a phrase might be different from the concepts or representations of the individual words forming the phrase. For example, the phrase "independent contractor" is a different concept than either of the constituent words "independent" and "contractor". Since it is not always possible to identify all possible phrases, or their frequency of occurrence, during creation of the database, the use of phrases as a matching term in probabilistic networks has not been altogether successful. Proximities (such as citations) and thesaurus and synonym classes have likewise not been successful identifiers because of the inability to identify all synonyms, proximities and thesaurus classes during creation of the database or to pre-assign their frequencies.
Techniques have been developed to identify phrases, synonyms, proximities and thesaurus classes as concepts in the query, and to find phrases, synonyms, proximities and thesaurus classes as representations in the documents. However, no satisfactory technique exists for identifying the frequencies of occurrence of representations in the documents and in the collection when the document collection is large and the frequencies of occurrence are not included in the database.
Another difficulty with probabilistic networks is that for large databases, for example databases containing about one-half million documents or more, the processing resources required to evaluate a query have been too great to be commercially feasible. More particularly, probabilistic networks required that all representations for all documents in the collection containing at least one query term must be examined against all of the concepts in the query. Hence, probabilistic networks required extensive computing resources. While such computing resources might be reasonable for small collections of documents, they were not for large databases. There is, accordingly, a need to improve the processing of probabilistic networks to more efficiently employ the processing resources.
For a more general discussion concerning inference networks, reference may be made to Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference by J. Pearl, published by Morgan Kaufmann Publishers, Inc., San Mateo, Calif., 1988, and to Probabilistic Reasoning in Expert Systems by R. E. Neapolitan, John Wiley & Sons, New York, N.Y., 1990.