In today's convoluted systems, data is entered into the system by different sources at different levels. This may result in having multiple representations for the same logical real world entity (e.g. contact information) due to data entry errors, varying conventions, and a variety of other reasons.
Such duplicated information may cause significant problems for users of the data. For example, it may lead to increased direct mailing costs for businesses because several customers may be sent multiple mailings. Such duplicates may cause incorrect results in analytic queries also (e.g. the number of customers in a particular location), and result in erroneous data mining models.
Since data typically grows rapidly over time, the problem of having duplicate records aggravates with time. Hence, a significant amount of time and money are spent on the task of detecting, eliminating, and handling duplicate records in a system.