When data are gathered from diverse sources, it is often difficult to determine whether data from different sources pertain to the same entity. For example, consider the problem of extracting data from web pages and other electronic documents on the Internet in order to build a repository of objects containing facts about entities. Generally, it is possible to analyze a web page and identify the name of the entity that the page describes. For example, one can determine that a web page describes the entity named “George Bush.” Therefore, one technique for building the fact repository is to create an object for each (name, web page) tuple and associate all of the facts on the given web page with that object.
Since the technique described above treats each object formed from a (name, web page) tuple as unique, it can result in many different objects associated with the same entity. There might be 7,000,000 web pages references for “George Bush,” 5,000,000 references for “Bill Clinton,” and an additional 500,000 references for “William Jefferson Clinton,” and each web page results in a separate object. However, some objects with the same name might be associated with different entities. For example, two objects named “George Bush” can be associated with different entities if one object references the 41st President of the United States while the other references the 43th President. Likewise, two objects named “Bill Clinton” can be associated with different entities if one object describes the 42nd President while the other describes a book about the Clinton presidency. Two objects with different names might also describe the same entity. Additional complications arise because even objects about the same entity are likely to contain different subsets of facts about the entity, and objects will sometimes contain erroneous facts due to errors in the source documents.
Ideally, the fact repository should contain exactly one extracted object for each unique entity. However, the large number of web pages and resulting extracted objects makes it impractical for human users to review and analyze the objects in the repository