The present invention is directed to a system for determining a relationship (such as similarity in meaning) between two or more textual inputs. More specifically, the present invention is directed to a system which performs improved information retrieval-type tasks by identifying clauses in documents being searched having certain predetermined characteristics.
The present invention is useful in a wide variety of applications, such as many aspects of information retrieval including indexing, pre-query and post-query processing, document similarity/clustering, document summarization, natural language understanding, etc. However, the present invention will be described primarily in the context of information retrieval, for illustrative purposes only.
Generally, information retrieval is a process by which a user finds and retrieves information, relevant to the user, from a large store of information. In performing information retrieval, it is important to retrieve all of the information a user needs (i.e., it is important to be complete) and at the same time it is important to limit the irrelevant information that is retrieved for the user (i.e., it is important to be selective). These dimensions are often referred to in terms of recall (completeness) and precision (selectivity). In many information retrieval systems, it is important to achieve good performance across both the recall and precision dimensions.
In some current retrieval systems, the amount of information that can be queried and searched is very large. For example, some information retrieval systems are set up to search information on the Internet, digital video discs, and other computer data bases in general. The information retrieval systems are typically embodied as, for example, Internet search engines and library catalog search engines. Further, even within the operating system of a conventional desktop computer, certain types of information retrieval mechanisms are provided. For example, some operating systems provide a tool by which a user can search all files on a given data base or on a computer system based upon certain terms input by the user.
Many information retrieval techniques are known. A user input query in such techniques is typically presented as either an explicit user generated query, or an implicit query, such as when a user requests documents which are similar to a set of existing documents. Typical information retrieval systems search documents in a larger data store at either a single word level, or at a term level. Each of the documents is assigned a relevance (or similarity) score, and the information retrieval system presents a certain subset of the documents searched to the user, (typically that subset which has a relevance score which exceeds a given threshold).
The rather poor precision of conventional statistical search engines stems from their assumption that words are independent variables, i.e., words in any textual passage occur independently of each other. Independence in this context means that a conditional probability of any one word appearing in a document given the presence of another word therein is always zero, i.e., a document simply contains an unstructured collection of words or simply put "a bag of words".
As one can readily appreciate, this assumption, with respect to any language, is grossly erroneous. Words that appear in a textual passage are simply not independent of each other. Rather, they are highly inter-dependent.
Keyword based search engines totally ignore this fine-grained linguistic structure. For example, consider an illustrative query expressed in natural language: "How many hearts does an octopus have?" A statistical search engine, operating on content words "hearts " and "octopus", or morphological stems thereof, might likely return or direct a user to a stored document that contains a recipe that has as its ingredients and hence its content words: "artichoke hearts, squid, onion and octopus". This engine, given matches in the two content words, may determine, based on statistical measures, that this document is an excellent match. In reality, the document is quite irrelevant to the query.
The art also teaches various approaches for extracting elements of syntactic phrases which are indexed as terms in a conventional statistical vector-space model. One example of such an approach is taught in J. L. Fagan, "Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods", Ph.D. Thesis, Cornell University, 1988, pp. 1-261. Another such syntactic based approach is described, in the context of using natural language processing for selecting appropriate terms for inclusion within search queries, in T. Strzalkowski, "Natural Language Information Retrieval: Tipster-2 Final Report", Proceedings of Advances in Text Processing: Tipster Program Phase 2, Darpa, May 6-8 1996, Tysons Corners, Va., pp. 143-148; and T. Strzalkowski, "Natural Language Information Retrieval", Information Processing and Management, Vol. 31, No. 3, 1995, pp. 397-417. A further syntactic-based approach of this sort is described in B. Katz, "Annotating the World Wide Web Using Natural Language", Conference Proceedings of R.I.A.O. 97, Computer-Assisted Information Search on Internet, McGill University, Quebec, Canada, Jun. 25-27 1997, Vol. 1, pp., 135-155.
These syntactic approaches have yielded lackluster improvements, or have not been feasible to implement in natural language processing systems available at the time. Therefore, the field has moved away from attempting to directly improve the precision and recall associated with the results of a query, to improvements in the user interface.
Another problem is also prevalent in some information retrieval systems. For example, where documents are indexed, such as in a typical statistical search engine, the index can be very large, depending upon the content set, and number of documents to be indexed. Large indices not only present storage capacity problems, but can also increase the amount of time required to execute a query against the index.