1. Field
The present invention relates to a method, system, and article of manufacture for using a data mining algorithm to generate format rules used to validate data.
2. Description of the Related Art
Data records in a database may be processed by a rule evaluation engine applying data rules to determine data records that have column or field values that deviate from the values that are expected by the rules. In the current art, the user manually codes data rules by first analyzing the data visually or using a profiling tool to obtain an understanding of the pattern of a well-formed record. Next a user builds logical expressions that define a set of rules to describe the normal characteristics of records in the set. These rules are then repeatedly executed against data sets to flag records that fail the conditions specified by the data rules and report on trends in failure rates over time.
A user may use a rule editor user interface to create new data rules or modify existing rules. Rules may be expressed in a rule language, such as BASIC, Structured Query Language (SQL), Prolog, etc. The user may then save rules in a rule repository in the rule language or in a common rule format. The user may then select rules from the rule repository and a data set of records to provide to the rule evaluation engine to execute the selected rules against the selected data records to validate the data, capture the results and display the results to the user.
Developing data rules can require a significant amount of user time, effort and skill to analyze patterns in data, especially for large data sets having millions of records with hundreds of columns. Further, rules to validate the format of data in data columns may be further difficult to create because many different formats may be used to record the data, such as different formats for phone numbers, etc. Data quality tools may be used to report the existence and frequency of multiple formats for a given column. However, they provide little help to understand why several formats exist, and, for a given row in the data set, which format is the correct one. The data analyst must use the report to decide which format should be allowed and create the corresponding data rules by hand. Since there may be numerous acceptable formats for data, the resulting validation rule may be too general (e.g. phone matches(999-9999 or 999-999-9999 or 99-99-99-99)), or too restrictive (e.g. phone matches(999-999-9999)) or too complex to build, understand and maintain (e.g. if country=(‘USA’ or ‘US’ or ‘United States’) then phone matches(999-9999 or 999-999-9999 or 9-999-999-9999).
There is a need in the art to provide improved techniques for generating and using format rules to validate the format of data.