This specification relates to the extraction of information from unstructured text.
Due to the developments of web-based computer technology and its ever increasing popularity, more and more people use search engines to locate information on the World Wide Web. Commonly, information is presented using web pages that include links or pointers to other web pages.
Developments have been made to leverage extraction systems to gather information from documents and use that information to answer questions directly. For example, systems have been described that can receive a set of seed facts and generate patterns by applying facts to a collection of sentences.
As described, seed facts are a pair of phrases that relate a subject phrase to an information phrase. Patterns include the words of the sentence broken into three parts; a prefix portion, an infix portion and a postfix portion. The phrases of the facts are used to separate a sentence into the three parts. The three part patterns are used to extract additional facts, and the new extracted facts are used to generate additional patterns. Using hundreds of thousands of iterations, this unsupervised iterative process of fact finding and pattern generation can continue to build up a collection of facts for question answering.