As computers and networks gain popularity, web-based computer documents (“documents”) become a vast source of factual information. Users may look to these documents to get answers to factual questions, such as “what is the capital of Poland” or “what is the birth date of George Washington.” The factual information included in these documents may be extracted and stored in a fact database.
Documents are often generated based on a template. For example, titles of the documents in the wikipedia.org website often follow a pattern of “[SUBJECT]—Wikipedia, the free encyclopedia,” where the section in square bracket is substituted with the subject of the 24207/11661/DOCS/1631371.6 page. These documents also often represent facts in a structured format. For example, documents in the wikipedia.org website frequently list facts in a table format.
Conventionally, objects (or entities) and related facts described in documents are identified and extracted (or learned) by human editors. This approach is insufficient for mass fact extraction because the vast volume of documents and the rapid increase in the number of available documents make it impractical for human editors to perform the task on any meaningful scale.
Based on the above, there is a need for a way to automatically identify objects and facts in documents.