Business analytics rely on large volumes of data from a variety of sources. A common data repository called a data warehouse is often used to provide a single data source for analysis. The data warehouse is filled with data from the various sources using an extract-transform-load (ETL) process. The extract process retrieves data from one or more source. The transformation process converts the retrieved data to a common schema and performs “data cleansing” to improve the data quality. Finally, the transformed data is loaded into the data warehouse. During the load process, the data quality may be further improved by enforcing uniqueness, referential integrity, and mandatory fields. Similarly, in Master Data Management a single source is used as an authentic information source for disbursing information in enterprise. This source is populated from multiple sources for create single view of entity using an ETL process. The transform process can be used to improve data quality and the load stage may enforce uniqueness, referential integrity, and other mandatory fields.
Analytics performed on the data in the warehouse is only as reliable as the quality of the data; accordingly, “data cleansing” has been an area of intense focus. The data cleansing process is often performed as part of the data transformation process. Data cleansing can correct noisy data, duplicates, and anomalies. Noisy data results from intentional or unintentional changes to some original representation such as: spelling errors, non-standard abbreviations, unknown words, repetitions, overloading information in a single column, or semantic problems such as using the wrong name of city or adding additional information (landmarks). Data duplicates are records that refer to the same entity, but differ by small data inconsistencies. Such duplicates may be caused by noisy data. Other data quality problems may be caused by anomalies so that the data deviates significantly from what is expected; for example, a zipcode that has too many digits. Such data is often handled using a combined analysis of the complete data (e.g. using frequency graphs and histograms) and observing the deviations from expected values. One simple cleaning for anomalous data replaces the anomalous data with NULL.
As businesses demand immediate analytics, the data analyzed must be high quality and fresh. The two requirements are in tension since cleansing to improve quality requires longer ETL processing, but freshness demands shorter ETL processing. The time-consuming process for data cleansing was tolerated when ETL processing was performed in batch mode during off-hours. Fresh data requires frequent ETL processing during the peak periods. Although techniques based on additional, or faster, computer resources may provide reduced cleansing time, solutions with existing computer resources are desired.