Data profiling relates to the analysis of an input data entry set with respect to statistical properties of data distribution, quality of data, and so on. It is the first essential step in the data integration process. Such data profiling is needed to help understand new data sources during data integration and data cleansing. Data profiling can provide more detailed information, such as reports on the numbers of valid addresses and fields with missing information. Data profiling reports can be used to identify problems, such as bad files, and to identify new data values that need to be further researched and possibly accommodated.
Data profiling is usually a labor-intensive, resource-devouring, error-prone process. In recent years, some data profiling systems have been developed that can dramatically reduce the time of data profiling from months to weeks to even days. These data profiling systems provide good support for new enterprise applications, data warehouse projects etc.
All of the existing data profiling methods include: pattern analysis for determining whether or not data values in a field or fields match the expected pattern or structure; column analysis for identifying statistical properties of data records, such as the number of null values contained in data, maximum/minimum values of data, mean values, standard deviation etc.; domain analysis for determining whether or not specific data values are acceptable or fall within an acceptable range of values. For example, data concerning “gender” must be “male” or “female” only, and other data values are unacceptable.
However, the above existing data profiling systems provide only symbol-level shallow data analysis. For example, they analyze some characters, words and digits of the address of an input data entry, but they do not know the meaning, namely semantics of the analyzed characters, phrases and digits. In practice, various free-text data, e.g. organization name, customer address etc., also need data profiling in many data integration and data cleansing applications. Especially, a plurality of free-text data entries comprises the mixture of various data types, e.g. address, organization name, person name, phone number etc.