The present invention relates to the field of digital computer systems, and more specifically, to a mechanism for deduplicating a set of records.
Deduplicating data records representing the same entity represents a challenging task. In particular, recognizing which different records with potentially slightly different content represent the same entity and being able to merge those duplicated records in a single golden record is a critical data quality task.
The deduplication process is made up of two phases: the match phase and the merge phase. During the match phase, the system scans all records of a table and tries to recognize groups of records that represent the same entity. During the merge phase, the system tries to build a single golden record merging information from the different duplicated records. Since the records to merge may contain different and even conflicting information, the merging process becomes challenging.