Active learning of entity resolution (ER) rules eases users' burden where interactivity is essential. Current solutions do not scale well over large data sets. For data sets with millions of records, each iteration might takes several to tens of minutes on a 6-node cluster.
Matching functions are basic units composing ER rules, which are provided by a user. The active learning learns the composition of several matching functions together with thresholds and generates an ER rule. Multiple iterations of active a learning process outputs a number of ER rules, which as a whole identifies entities that belong to the same real word entity.
Blocking functions are a special type of matching functions that are incorporated into ER rules. One ER rule should have at least one blocking function. Blocking functions are used to reduce the number of pairs to be compared from two-input datasets reducing the computation cost.