Embodiments of the invention relate to identifying non-distinct names in a set of names.
Entity resolution (or identity disambiguation) techniques may be used to determine when two or more entities (e.g., people, buildings, places, organizations, documents, cars, things, other objects, etc.) represent the same physical entity despite having been described differently. Sometimes these techniques are called de-duplication, match/merge, identity resolution, semantic reconciliation, or have other names. For example, a first record containing CustID#1 [Bob Jones at 123 Main Street with a Date of Birth (DOB) of Jun. 21, 1945] is likely to represent the same entity as a second record containing CustID#2 [Bob K Jones at 123 S. Main Street with a DOB of Jun. 21, 1945]. Entity resolution can be used within a single data source to find duplicates, across data sources to determine how disparate transactions relate to one entity, or used both within and across a plurality of data sources at the same time.
Entity resolution products may be provided with data sets that contain an array of identity data. However, there are many data sets whose chief identifying attribute is a name. For any entity, there may be multiple names that represent that entity, some of which are less distinct than others. A name that is a distinct representation of an entity is one that increases understanding or provides greater context of the identity. A name may include one or more of: a surname or initial, a middle name or initial, a given name or initial, etc.
The following is an example in which Entity 1 has three names (i.e., representations of Entity 1), while Entity 2 has two names (i.e., representations of Entity 2).
ENTITY 1:ENTITY 2:JOHN B. SMITHJOHN DAVID SMITHJOHN BRIAN SMITHPETE THOMPSONJOHN SMITH
Multiple names for a single entity may be known to be associated with that single entity based on various matching features (e.g., same social security number for each of the names). Thus, JOHN DAVID SMITH and PETE THOMPSON are known to be associated with Entity 2, although the names appear to be different. The name JOHN SMITH in Entity 1 is an obvious, non-distinct, duplicative representation of every other name in Entity 1. The name JOHN SMITH does not add any context or understanding of the names in Entity 1. Further, the name JOHN SMITH in Entity 1 could also be a non-distinct representative of the name JOHN DAVID SMITH in Entity 2.
In attempting to determine the likeness of the names in the two entities, an entity resolution system may perform a cross-entity scoring technique, which performs a pair-wise comparison of the cross product of the names (e.g., in each pair of names compared, one name is from Entity 1 and the other name is from Entity 2) and generates a score for each pair of names, which might result in the following:
JOHN DAVID SMITH vs. JOHN B. SMITH:80%JOHN DAVID SMITH vs. JOHN BRIAN SMITH:80%JOHN DAVID SMITH vs. JOHN SMITH:90%PETE THOMPSON vs. JOHN B. SMITH: 2%PETE THOMPSON vs. JOHN BRIAN SMITH: 2%PETE THOMPSON vs. JOHN SMITH: 2%
The highest score in this example results from the comparison to the name that is the least distinct representation of Entity 1 (JOHN SMITH). While this may be a legitimate score, this score does not accurately represent how alike or how different the names are in the entities. Rather, the highest score indicates that these entities are very much alike, when in fact, they have some significant conflicts (middle name). Similarly, the lowest score is generated from comparing very different names (PETE THOMSON). Again, this low score is legitimate, but the lowest score may not accurately reflect the likeness between the names in the two entities. Even if an entity resolution system used an average of the scores, rather than a highest or lowest score, to make decisions about the likeness of these names based on these scores, the result would most likely be skewed higher.
Some systems may take a statistical approach, in which the cardinality of a given name is directly correlated to the number of instances in the data set. This approach may assume an unbiased data set, may assume no knowledge about the true distinctiveness of a name, and may rely solely on its occurrence within a given data set. Also, this approach may assume a training set consisting of the entirety of the world's names. Some systems may take an approach of survivorship. Survivorship is the process of reducing each entity down to only the best elements. In such systems, an Entity would not contain multiple names because survivorship rules would reduce a list of names to one name. Typically, survivorship rules are simple rules (e.g., longest strings or most words).