Record matching refers to identifying matching or duplicate records, where the records correspond to the same real-world entity. One type of record matching task is to identify bibliographic records in a first database that correspond to the same publication in a second database. The goal of record matching in this case is to find pairs of records that represent the same bibliographic record.
Record matching has applications in information integration, data warehousing, census data, and health-care records management. The standard approach to record matching is to find textual similarity between records. This is typically done by computing a variety of similarity scores for a candidate pair of records. These scores then are combined using some logic to determine if the pair is a match. A similarity score quantifies textual similarity between the two records on some subset of attributes. The similarity score is computed using a string similarity function such as edit distance, jaccard, and cosine similarity. These similarity scores are combined to generate a final similarity score, which is then used to determine whether two records are matches.
Manually generating logic for combining similarity scores, however, can be difficult. This is why many record matching techniques use a learning-based approach. In the learning-based approach, record matching is viewed as a classification problem, where each pair has to be classified as a match or a non-match, and a suitable classifier is learned using labeled examples of matching and non-matching pairs.
One issue, however, is how to select the labeled examples. One type of learning-based approach uses passive learning. In the passive learning approach a user manually selects a set of examples to be labeled. Another type of learning-based approach uses active learning. Active learning is a form of machine learning where the learning algorithm obtains selects the set of examples to be labeled. Active learning is important in record matching because manually identifying a suitable set of examples to label can be difficult.
One limitation, however, of existing active learning record matching techniques is that they do not allow a user to control the quality of the learned classifier. Stated in informal terms, the quality of a classifier in record matching is measured using its precision and recall. The recall of a classifier is the number of pairs that it classifies as a match and the precision is the fraction of these pairs that are true matches. But current active learning record matching techniques lack a systematic way of using the learning algorithm to ensure that the learned classifier has precision above some threshold. Moreover, the behavior of these algorithms can be unpredictable and precision and the recall of the learned classifier can decrease when more labeled examples are provided. This unpredictability makes it difficult to use these algorithms in record matching settings with specific quality requirements.
Another limitation of these existing active learning record matching techniques is that they do not scale to large inputs. For each requested label, these algorithms iterate over all record pairs, and the number of such pairs is quadratic in the input size. This limits the input size to the active learning record matching techniques.