Data mining refers to an automated process of identifying systematic and generalized patterns within large volumes of historical data, so that it can be applied on new scenarios. As a first step, the business problem is identified and clearly outlined. Then, based on the class of problem identified and defined, suitable data preprocessing steps are identified and applied. Later, appropriate predictive models are applied and data insights are retrieved. Developing data rules can require significant amount of user time, effort and skill to analyze a pattern in the data especially when the data is very huge. Generally, the entire process of discovering patterns from data is cumbersome and time consuming.
A current solution implements a data processing system for directed data analysis. The system receives rules that represent relationship between several elements of the dataset. The system then displays rules and computes business measures of quality associated with rules. The user may change the rule by adding, deleting or changing the parameters of the rule. In addition to this, a graphical user interface is provided to display the rule and allow users to manipulate and perform directed data analysis.
Another solution discloses using a data mining algorithm to generate rules used to validate the rules on the selected region of a predicted column. Multiple rules are generated to associate conditions in the at least one predictor column with subsequences in selected region. The process qualifies the rules based on minimum support and confidence levels configured and ignore ones that do not qualify. The rule repository stores rules in a common format although being generated from multiple algorithms. The rule discovery user interface allows the user to specify one or more parameters to the engine in order to retrieve rules.
Another solution discloses generation of formatted rules that are used to validate on the dataset. The data comes in with several columns and each one with a different data type. Although several methods are available that identify the format, but fail to find if it is a valid format or not. The solution identifies the format of each data column and marks the formats and presents it in a user readable format and available for further manipulation.
Another solution discloses automatic identification of statistically significant patterns from data and initiation of analysis based on the identification. A decision tree approach of various embodiments may facilitate a reference for further analysis to pattern extraction. The current system employs N time rule to cap the number of statistically significant patterns to be extracted. In the current implementation, N can be 10. But algorithm has no limitation on N, however the system resources and data size will influence the time of extraction. An approach provides an article of manufacture for managing validation of models and rules to apply on the datasets. A schema definition validating the structure of data for compatibility along with the data quality model is determined at every stage of the data model.
As mentioned above, there are several approaches using rule engines that read data and generate rules using data mining algorithms, however these do not interpret and validate the read data and rules. Current systems do not take into account about the attribute types of the data such as if the data is actionable or not, if the data belongs to a particular group as demographics, transaction, and so on.