1. Field
The present invention relates to a method, system, and article of manufacture for using a data mining algorithm to generate rules used to validate a selected region of a predicted column.
2. Description of the Related Art
Data records in a database may be processed by a rule evaluation engine applying data rules to determine data records that have column or field values that deviate from the values that are expected by the rules. In the current art, the user manually codes data rules by first analyzing the data visually or using a profiling tool to obtain an understanding of the pattern of a well-formed record. Next a user builds logical expressions that define a set of rules to describe the normal characteristics of records in the set. These rules are then repeatedly executed against data sets to flag records that fail the conditions specified by the data rules and report on trends in failure rates over time.
A user may use a rule editor user interface to create new data rules or modify existing rules. Rules may be expressed in a rule language, such as BASIC, the Structured Query Language (SQL), Prolog, etc. The user may then save rules in a rule repository in the rule language or in a common rule format. The user may then select rules from the rule repository and a data set of records to provide to the rule evaluation engine to execute the selected rules against the selected data records to validate the data, capture the results and display the results to the user.
Developing data rules can require a significant amount of user time, effort and skill to analyze patterns in data, especially for large data sets having millions of records with hundreds of columns. Also one cannot often design data rules for records that have non-repeatable values. If the values in the analyzed columns are unique (in the case of a phone number for instance) or have a very high cardinality (in the case of the Zip code for instance), then the values of such columns cannot be predicted from values in other columns. The only way to detect data errors involving wrong values in subparts of such values is either to write all possible rules manually (this can be a very tedious task if the number of necessary rules is high), or use a more complex data flow that validates the whole value against valid values in a look-up database (this could be done to validate a zip code, but would be difficult to validate a phone number or a SSN), or, alternatively, validate the whole against a lookup-table that records invalid values (for example, 9999999999 is known to be an invalid US phone number).
There is a need in the art to provide improved techniques for generating and using rules to validate data in columns.