Recent statistics reveals that “dirty” data costs businesses in the United States billions of dollars annually. It is also estimated that data cleaning, a labor-intensive and complex process, accounts fox 30%-80% of the development time in a typical data warehouse project. These statistics highlight the need for data-cleaning tools to automatically detect and effectively remove inconsistencies and errors in the data.
One of the most important questions in connection with data cleaning is how to model the consistency of the data, i.e., how to specify and determine whether the data is clean. This calls for appropriate application-specific integrity constraints to model the fundamental semantics of the data Commercially-available ETL (extraction, transformation, loading) tools typically have little built-in data cleaning capability, and a significant portion of the cleaning work has to still be done manually or by low-level programs that are difficult to write and maintain. A bulk of prior research has focused on the merge-purge problem for the elimination of approximate duplicates, or on detecting domain discrepancies and structural conflicts.
There has also been recent work on constraint repair that specifies the consistency of data in terms of constraints, and detects inconsistencies in the data as violations of the constraints. However, previous work on constraint repair is mostly based on traditional dependencies (e.g., functional and full dependencies), that were developed mainly for schema design, but are often insufficient to capture the semantics of the data.
A need exists for improved methods and apparatus for detecting data inconsistencies.