In today's information age, data is the lifeblood of any company, large or small; federal, commercial, or industrial. Data is gathered from a variety of different sources in various formats, or conventions. Examples of data sources may be: customer mailing lists, call-center records, sales databases, etc. Each record from these data sources contains different pieces of information (in different formats) about the same entities (customers in the example case). Each record from these sources is either stored separately or integrated together to form a single repository (i.e., a data warehouse or a data mart). Storing this data and/or integrating it into a single source, such as a data warehouse, increases opportunities to use the burgeoning number of data-dependent tools and applications in such areas as data mining, decision support systems, enterprise resource planning (ERP), customer relationship management (CRM), etc.
The old adage “garbage in, garbage out” is directly applicable to this environment. The quality of the analysis performed by these tools suffers dramatically if the data analyzed contains redundant values, incorrect values, or inconsistent values. This “dirty” data may be the result of a number of different factors including, but certainly not limited to, the following: spelling errors (phonetic and typographical), missing data, formatting problems (incorrect field), inconsistent-field values (both sensible and non-sensible), out of range values, synonyms, and/or abbreviations (as illustrated in FIG. 1). Because of these errors, multiple database records may inadvertently be created in a single data source relating to the same entity or records may be created which don't seem to relate to any entity. These problems are aggravated when the data from multiple database systems is merged, as in building data warehouses and/or data marts. Properly combining records from different formats becomes an additional issue here. Before the data can be intelligently and efficiently used, the dirty data needs to be put into “good form” by cleansing it and removing these mistakes.
Thus, data cleansing necessarily involves the identifying of similarities between fields in different records. The simplest approach for determining which records have “similar” values for a particular field would be to define only identical values to be “similar”. If two records have the same value for the field, they would be considered to have similar values. Otherwise, they would not. This approach is very prone to “noise,” or errors present in the data causing differences between records describing the same object (i.e., causes records to have different values for the field).
Usually, the cleansing application output is evaluated by a human expert tracing through the result from the execution of the application on a given sample data set (where the correct answer is known). This interactivity is infeasible for all but the smallest record sets.
Many methods for evaluating the correctness of the application results solve this by only considering the number of records left in the cleansed output. Applications leaving a smaller number of records in the result are generally considered better, the idea being that these applications detected a higher number of duplicate records and eliminated them. The weakness of this measurement is that all discarded records are treated equally, whether they are actually duplicates (thrown out correctly) or were unique records (thrown out incorrectly). Additionally, measuring final record size alone fails to account for duplicate records being incorrectly left in the cleansed result.
This measurement rewards an application that aggressively merges records together (or has a low threshold for two records to be considered duplicates) and removes a large number of records. Duplicates incorrectly left increase the size of the final record set, while unique records incorrectly thrown out decrease the size. Thus, an application that makes the later type of error more frequently would be considered higher quality. Depending on the user application for which the data is being cleansed, incorrectly throwing out unique records may be much worse than leaving duplicate records in the final result.
Consider the example of a customer mailing list. A customer being represented by several records means that customer receives multiple copies of a catalog, which may be annoying. Completely removing all reference to the customer means that customer receives nothing, and that business could be completely lost.
Further, conventional measurements of cleansing applications totally separate correctness measurements from resource usage measurements (e.g., space used, execution time). Constraints on resources mean users usually are willing to make a tradeoff between quality and resources used. For example, a user with a deadline may choose to accept a result with more duplicates if the solution is arrived at quickly. There is no way to express this tradeoff if correctness and resource usage measurements are kept separate.