Record linkage is used to find common entities (e.g., persons, households, or businesses) between pairs of data records in disparate data files. Once these links are found, an improved data set can be obtained by merging the matched entity data. This resulting improved data set can then be used for the appropriate business purpose or further examined by “data mining”. If, however, the record linkage is done poorly, the “improved” data set might actually be worse than before. Therefore, being able to test or verify record linkage systems is important to insure quality and to allow improvements.
Testing record linkage systems operating on large data sets (“big data”) is difficult to do in practice, and is very difficult to do well, such as by producing quantitative metrics like false positive and false negative matches, as well as true positive and true negative matches.
Known methods for testing record linkage systems usually involve using ground-truth match data, if available. Ways to obtain such ground-truth match data include using data from a previous matching test, laboriously creating such data manually, or creating synthetic data.