There are over three hundred million people living in the United States, and for each given individual, there are generally several entity records. Additionally there are millions of organizations established in the United States, each of which is associated with several entity record documents. Examples of entity records may include real estate recordations, SEC filings, birth certificates, death certificates, marriage licenses, hunting and fishing licenses, motor vehicle licenses, litigation documents, news articles, financial documents, medical records, any structured record extracted from electronic text and the like. Creating a profile based on available data for any given entity (i.e. organization and/or individual) would therefore require searching several individual databases (one database for each type of entity record). This process of manually searching and collecting data throughout various databases is time consuming and potentially expensive. The problem is further compounded with the added effort necessary to ensure that records from various databases actually resolve to the correct, given entity.
Entity resolution (ER) is a well-studied problem in natural language processing that involves identifying and linking data to a real world entity. Each real world entity has a corresponding authority record which is usually maintained in a master record database (MRD). An authority record is a comprehensive record for a given entity. For example, an authority record may contain all the known names, addresses and phone numbers for a given entity. Entity records are linked to a given authority record based on some matching evidence. ER typically requires matching several different attributes of the entity using various similarity metrics and coming up with a match decision combining the individual similarity values. The individual attribute matching often is more complex than simple string matching, as a single entity attribute may be expressed in more than one way and mean the same thing. For instance, when trying to determine if the phrases “Barack Obama” and “President Obama” are referring to the same individual, a simple text string match would determine that these phrases were not the same because the text strings are not identical. However, individual attribute matching may include an attribute to the phrase “Barack Obama” indicating that this phrase refers to the President of the United States of America. An attribute is also associated with the phrase “President Obama” indicating that this phrase refers to the President of the United States of America. Then individual attribute matching compares the attributes between the phrases and determines that these phrases refer to the President of the United States of America. Therefore, these phrases should be connected in some way. Additionally, in particular when entities are person names, a match on only the name is not dispositive of identity because many names are shared by different individuals. Successful resolution of a name in text from an entity record to a given authority record often depends on using information beyond the name. For successful disambiguation, a rich authority record that contains diverse, high quality information relevant to the entity is desired. For example, additional information relevant to an entity “Michael Jordan” would help to disambiguate and resolve a person entity to “Michael I. Jordan,” the University of California-Berkeley professor instead of resolving it to the famous American basketball player “Michael Jordan.”
Entity resolution techniques have used Wikipedia® as a resource for resolving named entities. Wikipedia® articles are often very helpful when resolving due to its semi-structured content with relevant information about named entities. However, in a situation where resolution of a large population of person names occurs, Wikipedia® is not a comprehensive source of information. Wikipedia® generally focuses on creating articles and links for movie stars, professional sports players, politicians, writers, other celebrities and the like. Therefore, at least for individual people, using this approach remains highly inadequate because the majority of people to be resolved do not have any Wikipedia® entries. Most of the time the person name to be resolved does not have an entry in Wikipedia®, making the news and public information significant sources of additional information about the person.
Numerous known approaches have addressed the problem of entity resolution or entity matching. Known approaches include 1) an approach for resolving author entities by referring to the source of entity (see W. Shen, P. DeRose, L. Vu, A. Doan, and R. Ramakrishnan, “Source-aware entity matching: A compositional approach”); 2) an approach for disambiguating named entities based on popularity and context similarity (see J. Hoffart, M. A. Yosef, I. Bordino, H. Furstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum, “Robust disambiguation of named entities in text”); and 3) an approach for analyzing a link structure of a social network to disambiguate the mentions represented by the network nodes (see D. Balasuriya, N. Ringland. J. Nothman. T. Murphy, and J. R. Curran, “Named entity recognition in Wikipedia”). However, the first known approach does not necessarily work on generic person named entities retrieved from unstructured text, the second known approach does not actually enhance the authority record to improve the resolution accuracy, and the third known approach presumes that the entity in question belongs to a social network that may be accessed and analyzed.
One particular known approach to the problem includes the parent patent application U.S. patent application Ser. No. 12/341,913 entitled “Systems, Methods, and Software for Entity Relationship Database Resolution.” Generally, entity records are resolved to a structured authority record stored in a master record database. For example, entity records are resolved using structured resolution requests that consist of the named entity string extracted from the entity record with a name entity tagger, as well as information that co-occurs with the tagged name. Co-occurring information includes such elements as words occurring in close proximity to the reference, job title information, location names, organization names, and other person names. The source information in the authority record is built, for example, from professional directory databases containing limited curriculum vitae data such as name, company affiliation, job title, and work address. Supporting information such as current work activity, current clients, current associates, and other identifying information is not initially present in the authority records. For example, consider the authority record for ex-Illinois governor “Rod Blagojevich,” prosecuted in 2011 on fraud and corruption charges. An exemplary authority record for him might contain his name (Rod Blagojevich), job-title (Governor), location (Chicago, Ill.), political affiliation (Democrat) with no reference to his involvement in corruption and misconduct in office, as the original authority record is created well before the corruption story. So an entity record corresponding to the following news story has no matching evidence other than the name to match to his associated authority record: “The Rod Blagojevich who once challenged a prosecutor to face him like a man, the glad-handing politician who took to celebrity TV shows to profess his innocence, was nowhere to be found Wednesday as he was sentenced to 14 years in prison for corruption.” However, if the authority record could be enhanced to include the term “corruption” based on several entity records where his name co-appeared with the term, it would be easier to resolve the entity record to the correct authority record. That being said, manually updating the authority record with timely information is a daunting and extremely resource intensive task. Therefore, a better way is needed to enhance authority records.
Accordingly, the present inventors identified a need for improving entity resolution for sparsely populated entity records and information addition within an authority record.