Record linkage (RL) is the process of identifying records that refer to the same real world entity. Such records can occur over different data sources (e.g., files, websites, databases, etc.), as well as being in different formats across similar sources for example. A record linkage process can be performed to join or link data sets that do not share a common identifier such as a database key or URI for example, and it can be a useful tool when performing data mining tasks, for example. Record linkage analysis based on entity behavior has also many other applications. For example, identifying common customers for stores that are considering a merge; tracking users accessing web sites from different IP addresses; and helping in crime investigations.
A technique which can be used to match data originating from two entities is to measure the similarity between their behaviors. However, typically, a complete knowledge of an entity's behavior is not available to both sources since each source is only aware of the entity's interaction with that same source. A comparison of an entities' behavior will therefore be a comparison of their partial behaviors, which can be misleading and will generally provide less useful information. Moreover, even in the case where both sources have almost complete knowledge about the behavior of a given entity (such as when a customer who did all their grocery shopping at one store for one year and then at another store for another year), a similarity strategy may not help as many entities do have very similar behaviors. Accordingly, measuring the similarity can at best group the entities with similar behavior together but will not typically find their unique matches.