Record linkage is the process of identifying records between two or more data sets that represent the same entity. A record linkage process that computes the similarities between all pairs of records can be computationally prohibitive for large data sets. A blocking scheme may be used to reduce the number of computations to be performed by dividing the records into blocks and only comparing records within the same block.
Some blocking schemes may be created using machine learning, in which an algorithm is trained using a set of labeled data. The set of labeled data is a data sample that has been manually labeled (e.g., tagged) by humans to assist in the machine learning. However, the set of labeled data used for training is usually not large enough to characterize unlabeled data. As a result, the blocking scheme may perform poorly when processing the unlabeled data by generating too many candidate matches.