The present invention is directed to a system for determining a relationship (such as similarity in meaning) between two or more textual inputs. More specifically, the present invention is directed to a system which performs improved information retrieval-type tasks by identifying relations of constituents of documents being searched.
The present invention is useful in a wide variety of applications, such as many aspects of information retrieval including indexing, pre-query and post-query processing, document similarity/clustering, document summarization, natural language understanding, etc. However, the present invention will be described primarily in the context of information retrieval, for illustrative purposes only.
Generally, information retrieval is a process by which a user finds and retrieves information, relevant to the user, from a large store of information. In performing information retrieval, it is important to retrieve all of the information a user needs (i.e., it is important to be complete) and at the same time it is important to limit the irrelevant information that is retrieved for the user (i.e., it is important to be selective). These dimensions are often referred to in terms of recall (completeness) and precision (selectivity). In many information retrieval systems, it is important to achieve good performance across both the recall and precision dimensions.
In some current retrieval systems, the amount of information that can be queried and searched is very large. For example, some information retrieval systems are set up to search information on a global computer network (such as the Internet), digital video discs, and other computer data bases in general. The information retrieval systems are typically embodied as, for example, Internet search engines and library catalog search engines. Further, even within the operating system of a conventional desktop computer, certain types of information retrieval mechanisms are provided. For example, some operating systems provide a tool by which a user can search all files on a given data-base or on a computer system based upon certain terms input by the user.
Many information retrieval techniques are known. A user input query in such techniques is typically presented as either an explicit user generated query, or an implicit query, such as when a user requests documents which are similar to a set of existing documents. Typical information retrieval systems search documents in a larger data store at either a single word level, or at a term level. Each of the documents is assigned a relevance (or similarity) score, and the information retrieval system presents a certain subset of the documents searched to the user, (typically that subset which has a relevance score which exceeds a given threshold).
The rather poor precision of conventional statistical search engines stems from their assumption that words are independent variables (i.e., words in any textual passage occur independently of each other). Independence in this context means that a conditional probability of any one word appearing in a document given the presence of another word therein is always zero (i.e., a document simply contains an unstructured collection of words or simply put “a bag of words”).
As one can readily appreciate, this assumption, with respect to any language, is grossly erroneous. Words that appear in a textual passage are simply not independent of each other. Rather, they are highly inter-dependent.
Keyword based search engines totally ignore this fine-grained linguistic structure. For example, consider an illustrative query expressed in natural language: “How many hearts does an octopus have?” A statistical search engine, operating on content words “hearts” and “octopus”, or morphological stems thereof, might likely return or direct a user to a stored document that contains a recipe that has as its ingredients and hence its content words: “artichoke hearts, squid, onion and octopus”. This engine, given matches in the two content words, may determine, based on statistical measures, that this document is an excellent match. In reality, the document is quite irrelevant to the query.
The art also teaches various approaches for extracting elements of syntactic phrases which are indexed as terms in a conventional statistical vector-space model. One example of such an approach is taught in J. L. Fagan, “Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods”, Ph.D. Thesis, Cornell University, 1988, pp. 1-261. Another such syntactic based approach is described, in the context of using natural language processing for selecting appropriate terms for inclusion within search queries, in T. Strzalkowski, “Natural Language Information Retrieval: Tipster-2 Final Report”, Proceedings of Advances in Text Processing: Tipster Program Phase 2, Darpa, 6-8 May 1996, Tysons Corners, Va., pp. 143-148; and T. Strzalkowski, “Natural Language Information Retrieval”, Information Processing and Management, Vol. 31, No. 3, 1995, pp. 397-417. A further syntactic-based approach of this sort is described in B. Katz, “Annotating the World Wide Web Using Natural Language”, Conference Proceedings of R.I.A.O. 97, Computer-Assisted Information Search on Internet, McGill University, Quebec, Canada, 25-27 Jun. 1997, Vol. 1, pp., 135-155.
These syntactic approaches have yielded lackluster improvements, or have not been feasible to implement in natural language processing systems available at the time. Therefore, the field has moved away from attempting to directly improve the precision and recall associated with the results of a query, to improvements in the user interface.
Another problem is also prevalent in some information retrieval systems. For example, where documents are indexed, such as in a typical statistical search engine, the index can be very large, depending upon the content set, and number of documents to be indexed. Large indices not only present storage capacity problems, but can also increase the amount of time required to execute a query against the index.
The term “grammatical relations” is used to denote subject, object, and other constituents that can be identified on the basis of a syntactic analysis. Linguists recognize that grammatical relations are not all of equal status. For example, Keenan and Comrie have developed a summary of how different languages mark positions in a domain of relativization which are assumed by noun phrases. The summary is referred to as the Accessibility Hierarchy (or hierarchy of accessibility) and is described as follows.
Topic (optional)>subject>direct object>indirect object>object of preposition or post position>genitive (possessor)>object of comparison.
The hierarchy of accessibility illustrates a generalization that the lower a noun phrase is on the hierarchy, the less likely it is that this noun phrase will be expressed by a relative pronoun.
The accessibility hierarchy, and how it is obtained, is described in greater detail in “Language Typology and Syntactic Description, Complex Constructions”, Chapter 3, written by Edward L. Keenan, edited by Timothy Chopin, 1985; and Keenan, E. L. and B. Comrie, “N. P. Accessibility and Universal Grammar”, Linguistic Inquiry 8: 63-100 (1977).
It is also worth noting that some languages make extensive use of what linguists broadly refer to as “cases”. The English language still contains vestiges of an earlier case system. For example, in the pronominal system, the English language distinguishes subject versus object versus genitive with terms such as he, him and his. While linguists have devoted a great deal of time and effort in attempting to distinguish case from thematic role from other kinds of marking, the term case, as discussed herein, is used in the following two senses:
1. To describe morphological inflection, which typically involves changing the endings of words. German, Russian and Latin are examples of languages which exhibit morphological case.
2. To describe the use of adpositions (prepositions and postpositions) or particles to indicate the grammatical role of a noun phrase. Japanese and Indonesian are examples of languages which exhibit case information of this type. A discussion of grammatical relations and surface case is set out in Shibatani, “Grammatical Relations and Surface Cases”, Language, Volume 53, Number 4 (1977) pp. 789-809. Also, a discussion of grammatical function and morphological case is set out in Maling, “Of Nonminitive and Accusative: The Hierarchical Assignment of Grammatical Case in Finnish”, this article is published in A. Holenberg and U. Nikanne, Case and Other Topics in Finnish Syntactic, Studies in Generative Grammar, Foris (1992), pp. 51-76. In this patent, the term “relations” will be used to refer to both cases and grammatical relations.