1. Field
Embodiments of the invention relate to unguided curiosity in support of entity resolution techniques.
2. Description of the Related Art
The terms, identity resolution, entity resolution, semantic reconciliation generally refer to the same type of technique (e.g., algorithm). More specifically, such techniques frequently use either probabilistic or deterministic techniques or some combination of both to determine with a degree of confidence whether the entities (e.g., persons, places or things) are the same or not. This decision is an entity resolution “assertion.”
For example, a first record containing CustID#1 [Bob Jones at 123 Main Street with a Date of Birth (DOB) of Jun. 21, 1945] is likely to represent the same entity as a second record containing CustID#2 [Bob K Jones at 123 S. Main Street with a DOB of Jun. 21, 1945]. Entity resolution can be used within a single data source to find duplicates, across data sources to determine how disparate transactions (also referred to herein as records) relate to one entity, or used both within and across a plurality of data sources at the same time.
Entities have features (values that are collected or observed that can be more or less discriminating). For example, in the area of human entities, features may include one or more of: name, address, phone, DOB, Social Security Number (SSN), Driver's License (D/L), biometric features, gender, hair color, and including, but not limited to geospatial temporal attributes, familial or other relationships, patterns of life (like ones movement over the course of a day). By way of example, SSN's are generally very discriminating, dates of birth are less discriminating, and gender is not particularly discriminating at all. As another example, entity resolution on objects, such as a car, may include one or more features of: license plate number, Vehicle Identification Number (VIN), make, model, year, color, owner, and so on.
Features may be used to establish confidence (a degree of certainty that two discreetly described entities are the same). For the above example of CustID#1 and CustID#2, the confirming features of name, address, and DOB and the lack of conflicting features (e.g., features in disagreement, such as a SSN number of 111-11-1111 versus 33-44-5555) probably result in a high enough confidence to assert that the first record and the second record represent the same entity (e.g., person), without human review.
Entity resolution is sometimes referred to by other names e.g., deduplication, match/merge, and so on. Entity resolution systems are described further in: “Entity Resolution Systems vs. Match Merge/Merge Purge/List De-duplication Systems” by Jeff Jonas, published Sep. 25, 2007. Some entity resolution assertion systems can automatically reverse earlier assertions based on new records, hence correcting earlier assertions.
Entity resolution systems can be to some degree or another imprecise. Sometimes entity resolution may resolve two entities into one when they are not one (called a false positive) or determine two entities are not the same when they are the same (called a false negative). And other times, entity resolution processes may determine two entities are quite alike, yet there simply is not enough evidence (available features) to determine with certainty that the entities are the same. This type of uncertainty might simply be referred to as a “maybe.” For example, just because two records share a fairly rare name and have addresses in the same city, this may not cause an entity resolution engine (based on its configuration) to assert that these records are for the same people—nonetheless, it would almost certainly qualify as a “maybe”.
Most entity resolution systems, one would expect, will assert some entities as same, other entities as not same, and then some entities will likely fall into the category of “maybe.”
The greater the number of data sources and greater the number of records, the more the potential “maybes”, and organizations ranging from banks to insurance companies find themselves overwhelmed if they have to use human capital to manually evaluate all the “maybes.” Furthermore, organizations that make critical decisions (determination of credit worthiness, law enforcement investigations, etc.) based on entity resolution often feel compelled to not only evaluate the “maybes” but also the system generated assertions of same or not same. Unfortunately, there are typically not enough people in an organization to manually inspect and validate these computer generated decisions either.
Thus, there is a need for an improved entity resolution system capable of automatically addressing the “maybes” and validating assertions of same and not same in a more efficient, automated manner.