Data mining can extract useful information from a dataset and has a wide range of applications: fraud detection, business intelligence, medical diagnosis, target marketing and the like. Classification in data mining involves the construction of a model to represent the distribution of a class attribute across transactions in a dataset given a set of predictor attributes in the transactions. For example, data mining could generate a classification model from a dataset of medical records of smokers that represents the distribution of lung diseases (e.g., asthma, emphysema, lung cancer) given certain patient traits (e.g., age, gender, weight, years spent smoking).
Decision trees are one type of classification model and are sometimes favored for use in a data mining environment because they represent information in a manner that is easy to understand. The intuitive representation of classification data offered by decision trees makes them an important analysis tool for understanding relationships between attributes.
Presently, most decision tree construction algorithms build decision trees from a dataset in a breadth-first manner by recursively partitioning the dataset until a high proportion of each partition's transactions belong to one value of the class attribute. These algorithms (e.g., ID3, CART, CHAID) may not generate optimally accurate decision trees because such top-down algorithms assume independence between predictor attributes.
Another classification approach is to classify the dataset based on association rules. Association rules are statistical relationships between sets of item (the various values that attributes in a transaction may assume) in a dataset. The support of an association rule is the probability that all items in the rule exist in the dataset. An association-rule-based classification approach mines all itemsets that satisfy a minimum support and then builds a classification model from these frequent predictive itemsets.
Association-rule-based classifications possess undesirable traits despite their greater accuracy—the ability of a classification model to correctly predict the value of a class attribute for a transaction whose class attribute value is unknown. Association-rule-based classification approaches such as CBA, CMAR, CPAR and Large Itemsets are lazy-learning classifiers; they do not generate a general classification model but instead build query-specific models at run-time. The ACME association-rule-based classification algorithm generates a classification model that is completed in the learning phase, but the model is not as easy to understand as a decision tree. In an ACME-based classification, transactions traverse multiple paths and reach multiple leaves and the final class attribute value is determined by multiplying floating-point values of nodes that are visited by the transaction for each class attribute value and by picking the class attribute value that has the maximum product of the floating-point values.
Although such techniques can be useful, there remains a need for better data mining techniques.