A business or enterprise may store information about various items in the form of electronic records. For example, a company might have an employee database where each row in the database represents a record containing information about a particular employee (e.g., the employee's name, date of hire, and salary). Moreover, different electronic records may actually be related to a single item. For example, a human resources database and a sales representative database might both contain records about the same employee. In some cases, it may be desirable to consolidate multiple records to create a single data store that contains a single electronic record for each item represented in the database. Such a goal might be associated with, for example, a master data management program.
Currently, the consolidation process in a master data management program is a manual, time consuming, and error prone operation. For example, a person might manually review records of different data stores looking for potential duplicates. When a potential duplicate is found, he or she might investigate to determine the best way for the information to be combined. Such an approach, however, may even be impractical when a substantial number of records and/or data stores are involved.
Despite the significant advances in enterprise data management and analytics Data consolidation remains time-consuming to inspect and cleans a data set that contains massive amounts of customer information, and bring the data into a state that is usable for analysis. To improve data quality, data stewards must also identify and address issues such as unresolved duplicates, misspellings, missing data, data discrepancies, format inconsistency, and violations of business rules that define quality from an organization subjective perspective.
Extract-transform-load (ETL) processing cannot always address data quality issues automatically. ETL cannot handle unpredictable data issues, since it is deterministic in nature and ETL is not a tool for the business data end-user. Detection and refinement of data is complementary to the ETL processing, and should include handling data quality issues that cannot be handled automatically. For example, data discrepancies could require visual inspection and manual correction.