Data mining refers in general to data-driven approaches for extracting information from input data. Other approaches for extracting information from input data are typically hypothesis driven, where a set of hypotheses is proven true or false in view of the input data.
The amount of input data may be huge, and therefore data mining techniques typically need to consider how to effectively process large amounts of data. Consider manufacturing of products as an example. There, the input data may include various pieces of data relating to origin and features of components, processing of the components in a manufacturing plant, how the components have been assembled together. The aim of data mining in the context of manufacturing may be to resolve problems relating to quality analysis and quality assurance. Data mining may be used, for example, for root cause analysis, for early warning systems within the manufacture plant, and for reducing warranty claims. As a second example, consider various information technology systems. There, data mining may further be used for intrusion detection, system monitoring and problem analyses. Data mining has also various other uses, for example, in retail and services, where typical customer behaviour can be analysed, and in medicine and life sciences for finding causal relations in clinical studies.
Pattern detection is a data mining discipline. The input data can consist of sets of transactions where each transaction contains a set of items. The transactions may additionally be ordered. The ordering may be based on time, but alternatively any ordering can be defined. For example, each transaction may have been given a sequence number. For transactional data, association rules are patterns describing how items occur within transactions.
Consider a set of items I={I1, I2, . . . Im}. Let D be a set of transactions, where each transaction T is a set of items belonging to I. A transaction T thus contains a set A of items in I if A⊂T. An association rule is an implication of the form A=>B, where A⊂I, B⊂I, and A∩B=θ; A is called the body and B the head of the rule. The association rule A=>B holds true in the transaction set D with a confidence c, if c % of the transactions in D that contain A also contain B. In other words, the confidence c is the conditional probability p(B|A), where p(S) is the probability of finding S as a subset of a transaction T in D. The rule A=>B has support s in the transaction set D, when s % of the transactions in D contain A∪B. In other words, the support s is the probability of the union of items in set A and in set B occurring in a transaction. The lift of a rule is the quotient of the rule confidence and the expected confidence. The expected confidence of a rule is the confidence under the assumption that the occurrences of the rule head and rule body items in the transactions are statistically independent of each other. It is equal to the support of the rule head and expresses the degree of “attraction” between the items in the rule body and head. A lift value greater than 1 means that the items attract each other, whereas a value less than 1 is an indicator for repulsion.
The aim in association rule mining is to accurately find all rules meeting user defined criteria. The user may define a minimum support or confidence for the rules, as very rare or loosely correlated events may not be of importance for some applications. The user may also be interested only in particular items and wants to search only for patterns containing at least one of these interesting items.
The known data mining algorithms have drawbacks in certain situations. Depending on the amount of input data, in some circumstances up to hundreds of millions until billions of records, and on the size of the candidate pattern space, the breadth-first search may be slow since many scans on the original data source are needed and since each candidate pattern needs to be evaluated against all transactions. The depth-first search, on the other hand, may run out of memory for large amounts of input data, or—because of the large number of evaluations against the input data—it may be slow when the input data is swapped to the disk. Additionally, these data mining algorithms are based on item hierarchy. Since such item hierarchy is seldom available, it has to be determined first. Such determinations may be flawed and can therefore discredit the results of the algorithm.
Finding a classification model for predicting categorical “classification” values is another important data mining problem. Examples for this include predicting if a customer will move to a competitor, e.g. “churn prediction”, if a customer would respond to a marketing campaign, if a product like a car will be delivered on time, too late or too early or if a product like a computer chip is faulty. For building such a model one starts with historical data, i.e., cases with known classification values, for instance the churn and non-churn cases of the last 12 months, the results of a test marketing campaign or production data with delivery time values. These historical data can be collected in a data table containing one row for each entity, like customer or product, and having one column for the classification values and columns for other characteristics of the entities.
The task of a classification algorithm is to derive from the values of these others columns, e.g. the “independent variables”, the classification value, e.g. the value of the “dependent variable”, which is often called the training of a classification model. For churn prediction and for predicting if a customer responds to a marketing campaign the historical data may include, besides demographic data about a customer, like age, marital status or domicile, information about his or her behavior as a client. For predicting product delivery delays information about the products can be included, like specific features, and details about the production process.
Once such a classification model has been trained and its quality is good enough, which can be determined by using a subset of the historical data that has not been used for training the model, it can be used for predicting future cases. For these data only the values of the independent variables are known, but not those of the class label. The “predicted” values are determined by applying the classification model to these data. This step is called as well the “scoring” of a model. For churn prediction one determines in this way the customers who are likely to churn in the near future, for a marketing campaign ones determines the potential responders and for product delivery one determines the better estimate for the delivery date.
Most classification algorithms require that the input table for training a model contains one row per entity. However, available data tables with historic information may contain more than one row per entity, which makes it necessary to pre-process and transform the input data to fulfill this requirement.
This is the case when a part of the information about the entities is included in transactions. Tables with transactions have at least 2 columns, one for the id of the entity and an “item” column with categorical values. For sales transaction data containing the information which articles have been purchased by which customers the customer id would correspond to the entity id and the item column would contain the ids of the purchased articles. Such a table may contain additional columns with useful information. For sales transaction data this can be the purchase date or the price and the quantity of the articles. The customer to classification value mapping may be defined in a separate table. Additional information besides those included the transactions, like demographic information for customers or specific features for products, may be available as well. As this, however, is not relevant for this invention, one assumes that only a set of transactions and the entity to classification value is available.
One approach to solve this problem is to create a new table from the transactions table which contains a column for the entity id and one column for each possible categorical value of the item column. For a given entity and a categorical value the value of the corresponding column may be 1 if the transactional data contains such a record and 0 if this is not the case. For such a table the number of columns will be 1+number of distinct categorical values of the item column. This approach works well for a low number of distinct categorical values. However, for domains like manufacturing with hundreds of possible product features and production steps or retail with even thousands of different items sold in a supermarket this approach becomes inefficient if not unfeasible.
In this situation a hierarchy respectively taxonomy over the categorical values of the item column can help by creating columns only for higher concepts in the taxonomy. The value of the corresponding column for an entity can be the number of associated categorical values in the item column which belong to that higher concept.
However, if such a hierarchy is missing or the hierarchy does not reflect the appropriate partitioning with respect to the classification problem, the result will be a classification model of a poor quality. The latter may happen, for instance, if quality problems are caused by specific combinations of features that belong to different categories. If may happen as well, if a marketing campaign promotes organic food products, the product hierarchy does not reflect this characterization of the products.
There is thus a need for an efficient method for determining patterns in input data that overcomes at least some of the problems mentioned above in connection with known data mining techniques. In particular, there exists a need for a classification model free of item hierarchy which is able to handle standard classification models. In addition, the new model should be more efficient in terms of processing speed, in terms of memory consumption and in terms of necessary computing resources.