1. Technical Field
The embodiments herein generally relate to data management, and, more particularly, to detection and removal of duplicate records.
2. Description of the Related Art
Database comprises of records, which are collection of values for multiple fields. Purchase database containing transaction details of customers would be a perfect example. Any such database accumulates duplicate records over a period of time due to various reasons ranging from error-prone data-entry to merging of multiple databases. There is unnecessary cost involved in maintaining and processing of duplicate records.
The brute force approach for de-duping is to compare each record with every other record in the database, which is computationally intensive. One of the ways to find duplicates with lesser computation is to generate checksums for records. These checksums are a sort of keys for each record, which might be formed by combining one or more fields. Then these keys are used for finding duplicates. For example a key could be formed from first four characters of the field “Last Name” and the first five characters of the field “Zip Code”. Techniques like this helps in finding first level duplicates.
However, these techniques treat each field to be independent of each other, which is not true in most instances. Fields like state, zip and area code of phone field are dependent on each other. In many cases, user identification part of an email id is dependent on the name field of the record and the domain part of an email Id field might depend on the company field.