Data warehouses, which are repositories of data collected from several data sources, form the backbone of most current CRM and decision support applications. Data entry mistakes at any of these sources can introduce errors. Since high quality data is important for gaining the confidence of users of CRM and decision support applications developed over data warehouses, ensuring data quality is important to the success of data warehouse implementations. Therefore, great amounts of time and money are spent on the process of detecting and correcting errors and inconsistencies. Significantly, the types of errors and inconsistencies can be domain-specific.
The process of cleaning dirty data is often referred to as “data cleaning”. Data cleaning is an essential step in populating and maintaining data warehouses and centralized data repositories. A very important data cleaning operation is that of “joining” similar data. For example, consider a sales data warehouse. Owing to various errors in the data due to typing mistakes, differences in conventions, etc., product names and customer names in sales records may not match exactly with master product catalog and reference customer registration records respectively.
The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality. It is often the case that the same logical real world entity can have multiple representations in the data warehouse. For example, when a customer named Lisa buys purchases products from a retailer twice, her name might appear as two different records: [Lisa Doe, Seattle, Wash., USA, 98025] and [Lisa Do, Seattle, Wash., United States, 98025]. The discrepancy can be due, for example, to data entry errors and/or preferences of the salesperson who enters the data. Such duplicated information can significantly increase direct mailing costs because several customers like Lisa may receive multiple catalogs. In direct mailing campaigns with tight budget constraints such errors can be the difference between success and failure of the campaign. Moreover, such errors can cause incorrect results in analysis queries (e.g., How many customers of the retailer are there in Seattle?) as well as erroneous analysis models to be built.