Data integration and data cleaning applications need analytical tools to understand the characteristics of the contents and also the structure among columns of the data. Such analysis is critical before a cleaning query or a data exchange query can be designed for the data. Indeed, many data exchange tools often assume and exploit the existence of constraints that hold for the data being integrated. Unfortunately, traditional data analysis tools fall short for tasks that explore the contents and structure of the data such as the task of finding all columns with a large fraction of null values or all (almost) key columns in a given relation. In general, data quality metrics are hard to define because data quality can be domain/application specific.
One well known approach to defining data quality is using data profiling. A data profile is a collection of elements each of which is an aggregate statistics of the data, such as number of rows in the Customer table, number of distinct values in the State column, number of NULL or missing values in the Zip column, distribution of values of the City column, whether or not “name, address” is a key, etc. Indeed, there are several tools already available that allow profiling a data source, but while such tools are already useful, fall short in many ways. For example, the tools do not provide efficient and scalable key discovery for tables and fail in the ability to extract pattern profiles from the data.