1. Technical Field
Present invention embodiments relate to assessing data quality, and more specifically, to assessing the completeness and domain validity of column data.
2. Discussion of the Related Art
Assessing the quality of data is an aspect of master data management, data integration, and data migration projects. In such projects, data are typically moved from heterogeneous sources to a consolidated target. Sources may include databases having a large number of data tables, where each table has a number of data columns. Columns may include data that should be identified as having poor quality (e.g., data that do not match expectations of the target system). Existing products for assessing the quality of data can be categorized as tools for data profiling or tools for implementing data rules.
Data profiling tools are designed to provide understanding of data content and to identify potential data quality problems based on a comparison of information about the data with a user's understanding and expectation of what the data should be. Typically, these tools involve a user launching a “column analysis” that will compute statistics (e.g., cardinality, number of null values, frequency distributions, recurring formats, inferred types, etc.) for each column. The user reviews the result of the column analysis and identifies data quality problems based on the user's domain knowledge.
Tools for implementing data rules allow users to define rules describing features of good or bad data. For example, a rule can be an SQL query for all records that do not fulfill an expected condition (e.g., SELECT*FROM CUSTOMER WHERE ID=NULL). Rules can be complex (e.g., based on a relationship between data from separate columns) and portable (e.g., a regular expression). Once a user defines conditions for a rule, the rule can be applied automatically to verify the quality of data.