Due to the volume of data in circulation and the heterogeneity of data sources and imperfect data collection/extraction, data used in modern applications such as data warehousing, data analysis, and web data extraction typically contains errors and anomalies. Examples of errors which can be present in a database include duplicate records, records which violate one or more integrity constraints, records with missing values, heterogeneous data formats, and syntactical errors. A large number of known data cleaning systems address different types of errors with different quality and performance guarantees. A common goal among data cleaning systems is to provide scalable cleaning algorithms that generate high quality data repairs.
Typically, the majority of the existing data cleaning systems depend on automated cleaning of the data with minimal user intervention. If intervention is present, it comprises deciding which cleaning algorithms to use and adjusting parameters of the cleaning algorithms for example. Some systems allow a user to be more involved by providing an interactive data cleaning approach, which can potentially improve the quality of the generated data repairs. However, such data cleaning systems involve only a single user in the cleaning process, and thus do not scale well to large amounts of data.