Data Mining is an analytic process designed to explore data, usually large amounts of data in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new sets of data. Predictive data mining is the most common type of data mining and one that has the most direct applications. The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment, i.e., the application of the model to new data in order to generate predictions.
The initial exploration stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and—in case of data sets with large numbers of variables (“fields or dimensions”)—performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered). Then, depending on the nature of the analytic problem, this first stage of the process of data mining may involve an activity anywhere between a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage.
The second stage—model building or pattern identification with validation/verification—involves considering various models and choosing the best one based on their predictive performance, i.e., explaining the variability in question and producing stable results across samples. This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal—many of which are based on so-called “competitive evaluation of models”, that is, applying different models to the same data set and then comparing their performance to choose the best model. These techniques—which are often considered the core of predictive data mining—include: bagging (voting, averaging), boosting, stacking (stacked generalizations), and meta-learning.
The third stage—deployment—involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome.
Well known data mining categories include cluster analysis, regression, both linear and non-linear, classification, rule analysis, and time series analysis.
Clustering may be defined as the task of discovering groups and structures in the data whose members are in some way or another “similar”, without using known structures in the data.
Classification may be defined as the task of generalizing a known structure to be applied to new data. For example, an email program may attempt to classify incoming email as legitimate or spam. Common algorithms include decision tree learning, nearest neighbour, Naive Bayesian classification and neural networks.
Regression analysis attempts to find a function which models the data with the least error.
Association rule learning searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
The concept of data mining is becoming increasingly popular as an information management tool where it is expected to reveal knowledge structures that may guide decisions in conditions of limited certainty. Using manual techniques, this would not be possible because of the large number of data points involved.
However, in order to use data mining techniques effectively, a comparison of data mining models may be required in order to get optimal result out of existing data.
There are different scenarios in which a comparison of data mining models may be useful. Many application scenarios do not have single data mining models, but multiple, related ones. Some typical examples are data mining models derived at different points in time or in different subsets of the data, e.g., production quality data from different production sites. Another common case is representing the same data with data mining models on different types of data mining models in order to capture different aspects of the data. In all these cases, not only the individual data mining models are of interest, but also similarities and differences between them. Such differences may tell, for instance, how production quality and dependencies develop over time, how data mining models of different types differ in their ways of representing different products produced at the same facility or, how the production facilities differ between each other.
Comparing data mining models manually may be very costly, error-prone and not feasible depending on the amount of available data. While being extremely important, automatic comparison of data mining models has not yet been widely adopted in practice, essentially for two reasons: (a) They allow only comparing models of the same, pre-defined pattern type and thus, have a lack of generality making it impossible to use the methods of most other pattern types. —(b) They are based on the structure of the data mining models and thus, they are severely limited in their expressiveness, which leads to complex results that are often very hard to interpret.
Document U.S. Pat. No. 7,636,698 discloses a method of generating a data pattern—or data mining model—from a dataset based on a comparison of two classification data mining models. Disclosed is an architecture for analyzing pattern shifts in data patterns of data mining models and outputting the results. This allows comparing and describing differences between two semantically similar classification patterns—or classification mining models—and analyzing historical changes in versions of the same classification model or differences in pattern found by two or more classification algorithms applied to the same data.
Thus, there may be a need for an improved method and an engine for comparing data mining models, in particular for the case in which the data mining models do not belong to the same category of data mining models.