In today's information age, data is the lifeblood of any company, large or small; federal, commercial, or industrial. Data is gathered from a variety of different sources in various formats, or conventions. Examples of data sources may be: customer mailing lists, call-center records, sales databases, etc. Each record from these data sources contains different pieces of information (in different formats) about the same entities (customers in the example case). Each record from these sources is either stored separately or integrated together to form a single repository (i.e., a data warehouse or a data mart). Storing this data and/or integrating it into a single source, such as a data warehouse, increases opportunities to use the burgeoning number of data-dependent tools and applications in such areas as data mining, decision support systems, enterprise resource planning (ERP), customer relationship management (CRM), etc.
The old adage “garbage in, garbage out” is directly applicable to this environment. The quality of the analysis performed by these tools suffers dramatically if the data analyzed contains redundant values, incorrect values, or inconsistent values. This “dirty” data may be the result of a number of different factors including, but certainly not limited to the following: spelling errors (phonetic and typographical), missing data, formatting problems (incorrect field), inconsistent field values (both sensible and non-sensible), out of range values, synonyms, and/or abbreviations. Because of these errors, multiple database records may inadvertently be created in a single data source relating to the same entity or records may be created which don't seem to relate to any entity. These problems are aggravated when the data from multiple database systems is merged, as in building data warehouses and/or data marts. Properly combining records from different formats becomes an additional issue here. Before the data can be intelligently and efficiently used, the dirty data needs to be put into “good form” by cleansing it and removing these mistakes.
Thus, data cleansing necessarily involves the identifying of similarities between fields in different records. The simplest approach for determining which records have “similar” values for a particular field would be to define only identical values to be “similar”. If two records have the same value for the field, they would be considered to have similar values. Otherwise, they would not. This approach is very prone to “noise,” or errors present in the output data, causing differences between records describing the same object (i.e., causes records to have different values for the field). Examples of errors occurring during data entry are illustrated in FIG. 1.
A simple way for evaluating a data cleansing method is the introduction of errors to a set of cleansed records producing an artificially “dirty” set of records. The “dirty” set of records may then be cleansed by the data cleansing method. The outputs may then be compared to evaluate the efficiency of the data cleansing method.
The error introduction may comprise the random addition, deletion, or alteration of characters in fields in the set of records. Duplicates of entire existing records may also be added to the cleansed records prior to the introduction of errors.
Conventional methods only produce as output types of errors and frequency of occurrence for a wide range of data types (i.e., last names, street addresses, etc.). These methods also may produce as output breakdowns for error rates based on how the information was entered into the system (i.e., information taken over the telephone, optical character recognizer, used entered form, etc.). However, these outputs are not combined in a single model (i.e., there may exist an error model for a particular type of information such as last names, telephone information, etc., but an error model for last names taken over the telephone would not exist).