Management and storage of data is one of the most important needs for all types of organizations or companies (e.g., large or small companies, commercial entities, non-profit organizations, government entities or any similar entities). There can be numerous types of data for example including, but not limited to, customer data, vendor data and employee data, and administrative data. The data is gathered from a variety of different data sources and is electronically stored in various formats as records in databases. Examples of data sources may include, but are not limited to, employee database, sales database, contact center database, offline records, customer escalation records, company's social media followers records, customer query records and mailing lists records.
Each record from these data sources may contain different information about customers and the information may be in different formats. For example, data gathered from the mailing lists may include email addresses of customers along with their names. Similarly, data gathered from the social media profiles may include customer names and information related to social media. The information gathered from different sources may be in different formats as each data source is pre-customized to receive information in different formats. Moreover, the customer may also use different information associated with them for different sources. For example, the name of a customer on a social media profile could be slightly different than on the mailing list.
There may be instances when different records in the company may correspond to a same customer/entity, thereby creating multiple records for the same customer in the database. This issue aggravates if the records are entered in a free text format in the database. The free text format allows different information to be entered in different fields of the records without any limitations and checks. Thus, the records may include redundant data, incorrect values or inconsistent values. Some other issues encountered while entering free text data may include missing data, incorrect spellings, usage of abbreviations and short forms, formatting issues, synonyms used, and the like. Because of these issues and errors, multiple records may inadvertently be created for a same entity/customer.
Over time, as data entry and merging of records from different sources occur, duplicate copies may begin to creep into the database. Such occurrence of duplicate copies is referred to as data duplication. Storing duplicate data in a database is inefficient for several reasons, for example, duplicate data could make pricing analysis almost impossible, duplicate vendor data could make any vendor rationalization difficult, duplicate data may lead to memory constraints, and the like. Therefore, identifying and eliminating duplicate data is one of the major problems in the area of data cleaning and data quality. Several approaches have been implemented to counter the problem of data duplication. However, none of the approaches are effective specifically in large-scales.