A challenge in data integration projects is always the quantity of data sources that need to be understood. Enterprise information technology (IT) applications usually involve the use of several thousands of tables, each of these tables containing up to several hundreds of columns. The meaning and semantic of each of these columns is often not documented and it is a tedious but indispensable task to understand the data before the integration work can begin.
To facilitate this task, data understanding tools, such as IBM® InfoSphere® Information Analyzer have been invented to help the user understand and document the data sources by looking directly at the data contained in these sources. (IBM and InfoSphere are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide.)
Common features of such tools are to compute general statistics for each analyzed column, such as cardinality of the column, minimum/maximum values, most frequent value, etc., value distributions showing the user what different values are contained in a single column and how frequent the different values are, format analysis showing the user what different formats are used by the values of the columns, for instance a ZIP code being represented in 60% of the cases as a five digits value and in 40% of the cases as a five digits value followed by a four digits value, inferred types showing the user what type is really used by the values of the columns, independently from the data type of the column as defined in the metadata of the database, for instance a character based column may contain only numeric values coded as strings, and basic data classification, like for example the column contains free text, numbers or codes.
Advanced data profiling tools, such as the above mentioned Information Analyzer product additionally allow the use of specialized algorithms to detect specific domains from the data of a column. For instance, by using regular expressions or a dictionary of values, such algorithms may be able to detect that a column contains a ZIP code, a US address, a person name, an email address, a phone number, etc.
This ability to classify columns based on the values they contain is important since it documents a cryptic physical source with semantic information and facilitates its mapping to a business term or a logical model, which is well understood by the user. However in the state of the art, the algorithms used to classify the data are specific to the domain. For classical domains, such as US addresses, person names, etc. specialized algorithms are delivered out of the box. However for domains which are specific to a company like product references, enterprise codes, ticket reports, etc. or are not common like a postal code in Belgium, for example, new algorithms have to be written for the particular domain and plugged in the tool. Since the development of such algorithms requires special knowledge, the users usually have to classify a large part of their data manually, by looking at the data and using expert knowledge to recognize the domain. This is a time consuming operation, since the data of each individual column must be evaluated manually, even if the semantic of the column is used by different columns.