Text is the most significant repository of human knowledge. Many algorithms have been proposed to mine textual data. Algorithms have been proposed that focus on document clustering (Larsen, B. and Aone, C. 1999. Fast and effective text mining using linear-time document clustering. In Proceedings of KDD-99. pp. 16–22. San Diego, Calif.), identifying prototypical documents (Rajman, M. and Besançon, R. 1997. Text Mining: Natural Language Techniques and Text Mining Applications. In Proceedings of the seventh IFIP 2.6 Working Conference on Database Semantics (DS-7).), or finding term associations (Lin, S. H., Shih, C. S., Chen, M. C., et al. 1998. Extracting Classification Knowledge of Internet Documents with Mining Term Associations: A Semantic Approach. In Proceedings of SIGIR-98. Melbourne, Australia.) and hyponym relationships (Hearst, M. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. In Proceedings of ACL-92. Nantes, France.) This invention addresses an unsupervised method for discovering inference rules, such as “X is author of Y≈X wrote Y”, “X solved Y≈X found a solution to Y”, and “X caused Y≈Y is triggered by X”. Inference rules are extremely important in many fields such as natural language processing, information retrieval, and artificial intelligence in general.
For example, consider the query to an information retrieval system: “Who is the author of the ‘Star Spangled Banner’?” Unless the system recognizes the relationship between “X wrote Y” and “X is the author of Y”, it would not necessarily rank the sentence                . . . Francis Scott Key wrote the “Star Spangled Banner” in 1814.higher than the sentence        . . . comedian-actress Roseanne Barr sang her famous shrieking rendition of the “Star Spangled Banner” before a San Diego Padres-Cincinnati Reds game.        
The statement “X wrote Y≈X is the author of Y” is referred to in this patent document as an inference rule. In previous work, such relationships have been referred to as paraphrases or variants (Sparck Jones, K. and Tait, J. I. 1984. Automatic Search Term Variant Generation. Journal of Documentation, 40(1):50–66.; Fabre, C. and Jacquemin, C. 2000. Boosting Variant Recognition with Light Semantics. In Proceedings of COLING-2000. Sarrebrücken, Germany.) In this patent document, inference rules include relationships that are not exactly paraphrases, but are nonetheless related and are potentially useful to information retrieval systems. For example, “X caused Y≈Y is blamed on X” is an inference rule even though the two phrases do not mean exactly the same thing.
Traditionally, knowledge bases containing such inference rules are created manually. This knowledge engineering task is extremely laborious. More importantly, building such a knowledge base is inherently difficult since humans are not good at generating a complete list of rules. For example, while it is quite trivial to come up with the rule “X wrote Y≈X is the author of Y”, it seems hard to dream up a rule like “X manufacture Y≈X's Y factory”, which can be used to infer that “Chrétien visited Peugot's newly renovated car factory in the afternoon” contains an answer to the query “What does Peugot manufacture?”
It is known in the art of information retrieval to identify phrasal terms from queries and generate variants for query expansion. Query expansion is known to promote effective retrieval of information and a variety of query expansion algorithms have been proposed. U.S. Pat. No. 6,098,033 issued Aug. 1, 2000, disclosed a method for computing word similarity according to chains of semantic relationships between words. Examples of such semantic relationships include Cause, Domain, Hypernym, Location, Manner, Material, Means, Modifier, Part, and Possessor. While the method disclosed in (U.S. Pat. No. 6,089,033) uses chains of relationships as features to compute similarities between words, the invention disclosed in this patent document uses words as features to compute the similarity between chains of relationships.
U.S. Pat. No. 6,076,088 issued Jun. 13, 2000, disclosed a method to extract Concept-Relation-Concept (CRC) triples from a text collection and to use the extracted CRC triples to answer queries about the text collection. The invention disclosed in this patent document differs from U.S. Pat. No. 6,076,088. Particularly, the CRC triples represent information explicitly stated in the text collection whereas operation of the present invention results in inference rules that are typically not stated explicitly in a text collection.