The present invention relates in general to the field of data management, and in particular to a method and a system for creating and processing a data rule. Still more particularly, the present invention relates to a data processing program and a computer program product for creating and processing a data rule.
With the increasing complexity of modern IT infrastructure it becomes more and more important for enterprises to control the quality of their data. Bad data quality, such as data inconsistencies, lack of standardization, data redundancy, duplication of records, miss-use of data fields, etc. can lead to a serious business impact. For this reason the market for tools specialized in data quality assessment and monitoring keeps expanding.
Tools usually allow users to profile their data by analyzing the value distributions for each column, analyzing the values, the typical formats and to specify the valid range or set of values and formats. Additionally some of these tools can also provide functions to analyze the functional dependencies between multiple columns.
From the result of these analyses the user can identify typical data quality problems, such as uniqueness constraints not being respected, or columns containing unexpected values or non-standardized values. Once these problems are discovered, the user can write data quality rules whose role is to monitor a potential data quality problem.
For example, a rule may be defined to check the cardinality of a column, if the data analysis shows that this column which is expected to contain unique values has some duplicate values. If some duplicate values are added to the column, the rule will indicate which records violate the uniqueness constraint of the column A further rule may be defined to verify that the values in the column always match a regular expression representing the possible formats of a phone number, if the data analysis shows that a column which is supposed to contain phone numbers also contains some other values which are not phone numbers, for instance Email addresses.
Once the initial data analysis has been done to understand the semantic of the data, and existing data quality problems have been shown, the data quality of the data is monitored over the time by defining a collection of rules verifying all the constraints which should be verified by correct data. By doing this, new data quality problems appearing over the time with the addition of new data or with the modification of existing data can be recognized early enough to be corrected.
On complex IT systems the manual definition of such rules is difficult and time consuming because of the high number of tables and columns and the lack of metadata/documentation. Models may contain about 70,000 tables, most tables having 50-100 columns, and it is not clear in such systems what the semantic of each table is. The name of the tables and columns are quite often cryptic and give no indication about the real semantic of the data they contain.
Quite often the same type of rule can be applied to different columns from different data sources. For instance a system may store phone numbers in many different tables for the different modules and a rule checking for the validity of the phone number it contains should be ideally defined for all these different columns containing phone numbers.