A challenge for builders of databases whose information is culled from multiple sources is the detection of duplicates, where a single real-world entity gives rise to multiple records. For example, online citation indexes need to be able to navigate the different capitalization and abbreviation conventions that appear in bibliographic entries; government agencies need to know whether a record for “Robert Smith” living on “Northwest First Street” refers to the same person as one for a “Bob Smith” living on “1st St. NW”; and consumers need to know whether publicly available records correspond to the same or different entity. This problem becomes more significant as the amount of readily available information continues to increase.
A standard machine learning approach to this problem is to train a model that assigns scores to pairs of records where pairs scoring above a threshold are said to represent the same entity. Transitive closure is then performed on this same-entity relationship to find the sets of duplicate records. Comparing all pairs of records is quadratic in the number of records and so therefore is intractable for large data sets. In practice, using an approach called “blocking”, only a subset of the possible pairs is referred to the machine learning component and others are assumed to represent different entities. So a “Robert Smith“−”Bob Smith” record pair may be scored while a “Robert Smith“−”Barack Obama” pair is dismissed. This risks a false negative error for the system if the “Robert Smith” and “Barack Obama” records do in fact refer to the same person, but in exchange for this the system runs faster.
The term of art for this process is blocking because it groups similar-seeming records into blocks that a pairwise decision making component (a component which might use either a machine learning or a deterministic technique to determine whether the pair of records should in fact be linked) then explores exhaustively. A common technique of published blocking algorithms is the general strategy of quickly identifying a set of record pairs to pass along to a linkage component.
Previous work relevant to blocking is known. See e.g., A. K. Elmagarmid, P. G. Iperirotis and V. S. Verykios, “Duplicate Record Detection: A Survey,” IEEE Transactions on Knowledge and Data Engineering, pages 1-16, 2007; A. Borthwick, A. Goldberg, P. Cheung and A. Winkel, “Batch Automated Blocking And Record Matching,” 2005, U.S. Pat. No. 7,899,796; A. McCallum, K. Nigam and L. H. Ungar, “Efficient Clustering Of High-Dimensional Data Sets With Application To Reference Matching,” Proceedings of the ACM International Conference on Knowledge Discover and Data Mining, pages 169-178, 2000); M. A. Hernandez and S. J. Stolfo, “Real-world data is dirty, data cleansing and the merge/purge problem,” Journal of Data Mining and Knowledge Discovery (pages 1-39, 1998). However, additional improvements are possible and desirable.
We describe herein a novel blocking technique for duplicate record detection that operates on the intuitive notion of grouping together records with similar properties and then subdividing the groups using other shared properties until they are all of tractable size. A non-limiting example implementation in the MapReduce framework provides parallel computing that may scale to inputs in the billions of records. We call our overall non-limiting technique dynamic blocking because the blocking criteria adjust in response to the composition of the data set. We want blocking to be a mechanical automatically implemented process, not an art.
One example non-limiting blocking strategy is used to deploy a massive database of personal information for an online people search. This database distills a heterogeneous collection of publicly available data about people into coherent searchable profiles. This distillation process can be framed as a duplicate detection task. We have developed a non-limiting novel blocking procedure that in addition to the standard performance/recall tradeoff is tailored to 1) scale to very large data sets and 2) robustly handle novel data sources. Scaling to very large data sets is useful because we map billions of input records to hundreds of millions of people in the real world. This is possible with distributed computing, and the ability to distribute the work informs the design. Robustly handling diverse data sources is useful because we are acquiring new and diverse sources of information all the time, so the hand-crafting of the blocking procedure by experts can become a bottleneck.
Additional example non-limiting dynamic blocking features and/or advantages of exemplary non-limiting implementations include:                Extending known blocking algorithms in significant and non-obvious ways        Adapting blocking algorithms to run in a scalable, distributed computing environment        Adding a ramp factor for more efficient extraction of the sub-block space        In common with many known blocking systems, using a general strategy of quickly identifying a set of record pairs to pass along to a linkage component        Particular advantageous ways to construct sets and handle oversized sets        Use of known strategy of allowing sets of records to overlap in a new context        Can be applied to data sets when there is no obvious quickly-calculable metric between records and the number of records makes even a fast calculation for all pairs intractable        Create multiple top-level blocks that can be worked on independently        As records have several property dimensions along which they may vary, no need to try to define a single ordering that places similar records next to each other.        We allow the maximum block size to be a function of the block key length.        This blocking procedure incorporates innovations necessary to make it work on data sets containing billions of records.        The entire system from start to finish including the blocking component that is the focus of this patent can be run in parallel across a distributed cluster of computers.        The blocking algorithm can handle arbitrarily large input record blocks.        Example non-limiting system can be implemented using the MapReduce distributed computing framework.        Pair deduplication and text normalization (while I/O constraints of working with large data sets may prevent a straightforward deduplication of all pairs, a suitable algorithm can perform this deduplication without incurring the I/O costs)        One non-limiting arrangement may directly address errors in the original data, relying instead on multiple top-level keys to catch mis-blockings due to typos, but there is also a possibility to implement text normalization for our system.        Additional record properties such as, for example, Soundex versions of text fields.        Use of pair mechanism to discover relationship between entities.        