In today's information age, data is the lifeblood of any company, large or small; federal, commercial, or industrial. Data is gathered from a variety of different sources in various formats, or conventions. Examples of data sources may be: customer mailing lists, call-center records, sales databases, etc. Each record from these data sources contains different pieces of information (in different formats) about the same entities (customers in the example case). Each record from these sources is either stored separately or integrated together to form a single repository (i.e., a data warehouse or a data mart). Storing this data and/or integrating it into a single source, such as a data warehouse, increases opportunities to use the burgeoning number of data-dependent tools and applications in such areas as data mining, decision support systems, enterprise resource planning (ERP), customer relationship management (CRM), etc.
The old adage “garbage in, garbage out” is directly applicable to this environment. The quality of the analysis performed by these tools suffers dramatically if the data analyzed contains redundant values, incorrect values, or inconsistent values. This “dirty” data may be the result of a number of different factors including, but certainly not limited to, the following, as illustrated in FIG. 1: spelling errors (phonetic and typographical), missing data, formatting problems (incorrect field), inconsistent field values (both sensible and non-sensible), out of range values, synonyms, and/or abbreviations. Because of these errors, multiple database records may inadvertently be created in a single data source relating to the same entity or records may be created which don't seem to relate to any entity. These problems are aggravated when the data from multiple database systems is merged, as in building data warehouses and/or data marts. Properly combining records from different formats becomes an additional issue here. Before the data can be intelligently and efficiently used, the dirty data needs to be put into “good form” by cleansing it and removing these mistakes.
Thus, data cleansing necessarily involves the identifying of similarities between fields in different records. The simplest approach for determining which records have “similar” values for a particular field would be to define only identical values to be “similar”. If two records have the same value for the field, they would be considered to have similar values. Otherwise, they would not. This approach is very prone to “noise,” or errors present in the data causing differences between records describing the same object (i.e., causes records to have different values for the field).
Usually, the cleansing application output is evaluated by a human expert tracing through the result from the execution of the application on a given sample data set (where the correct answer is known). This interactivity is infeasible for all but the smallest record sets.
Conventional systems for visualizing and interfacing with the cleansing process are limited to only displaying information and do not allow user input. In most cases, only the cleansed result and some rudimentary statistics about it are presented in a report after the cleansing application is complete. Examples of reported statistics include the number of records in the result, the average size of duplicate record sets, and number of mistaken values the application has corrected. The report may also include a listing of anomalies or “special cases” encountered during processing requiring further investigation. Online human interaction with learning systems for data cleansing is not supported by conventional application interfaces.