The present invention relates generally to the field of data processing, and more particularly to data de-duplication.
Data de-duplication is an important operation when cleansing a data source. Data de-duplication is typically achieved by searching in a database for duplicated records that represent the same entity and merging records so that a single record remains for the entity. In searching for duplicate records, de-duplication techniques typically distribute the records into multiple groups in a way that similar records may fall into the same group, conduct a column by column comparison of two selected records within the same group, and compute a match score indicating the probability or likelihood that the selected records represent the same entity. Record pairs that have a sufficiently high score are considered duplicates and may be merged to create a single ‘golden’ record.