Entity resolution in the information processing field typically refers to determining whether multiple records, documents, web pages or other data objects represent the same real-world entity. The data objects may be from the same source or from different sources. Examples of entity resolution processes include record matching, record linkage or deduplication. The need for entity resolution often arises in information integration applications where data objects representing the same real-world entity are presented in different ways and there is a lack of a unique identifier for the real-world entity. As a more specific example, a telecommunications equipment supplier may be referred to as “Alcatel-Lucent,” “Alcatel Lucent” and “Lucent” in different records, web pages or other data objects even though these data objects all represent the same company.
A number of entity resolution approaches are known. One possible approach is to perform pairwise comparison of all data objects. However, this simple approach is inefficient, in that it requires O(n2) comparisons for a data set of n objects, and is therefore not scalable for use with very large data sets. Other approaches utilize a technique known as “blocking” in order to provide improved efficiency. Blocking eliminates the need for pairwise comparison of all data objects by assigning the data objects to blocks such that data objects from different blocks are not considered as possible matches, i.e., cannot refer to the same entity. Therefore, pairwise comparisons are only necessary for pairs of objects within the same block in order to identify whether or not they represent the same entity.
Examples of conventional blocking techniques include sorted neighborhood, bigram indexing and canopy clustering. Sorted neighborhood is one of the most efficient of the conventional blocking techniques, with a computational complexity of O(n log n). Unfortunately, it fails to capture the pairwise similarities between data objects if two similar strings start with different characters, e.g., “Alcatel-Lucent” and “Lucent-Alcatel.” On the other hand, bigram indexing and canopy clustering capture pairwise similarities better than sorted neighborhood, but they are less efficient because both have computational complexities of O(n2). Thus they do not scale well with large data sets.