Data profiling is the process involving an initial analysis of existing data on multiple source systems. The data are extracted from the source systems using ETL (extract, transform, and load) processes. Such data are usually presented in a table format. Data profiling analyzes the data to retrieve information for each analyzed columns, such as their inferred types, general statistics about the values it contains, common formats, value distributions, etc. With this information, the user can define the valid range of values for each column and measure the number of records which are outside this valid range. Data profiling can also include a cross-domain analysis function examining content and relationships across tables to identify overlaps in values between columns and any redundancy of data within or between tables. The cross-domain analysis can be used to identify primary/foreign key (PK/FK) relationships between tables. In addition, such a process can include a monitoring of the data quality, which is done by regularly evaluating a defined set of metrics on rules on the data.
The data profiling process is computationally intense, and requires sufficiently powerful systems to accomplish the task within acceptable periods. Also, processing large volumes of data would be prohibitive for some analytical processes, like cross-domain analysis, and would constrain the user to limit the analysis to a small set of data to get that analysis completed in a reasonable time.