1. Field of the Invention
This invention relates in general to computer implemented data mining, and in particular to dimension reduction for data mining application.
2. Description of Related Art
Data mining is the process of finding interesting patterns in data. Data mining often involves datasets with a large number of attributes. Many of the attributes in most real world data are redundant and/or simply irrelevant to the purposes of discovering interesting patterns.
Dimension reduction selects relevant attributes in the dataset prior to performing data mining. This is important for the accuracy of further analysis as well as for performance. Because the redundant and irrelevant attributes could mislead the analysis, including all of the attributes in the data mining procedures not only increases the complexity of the analysis, but also degrades the accuracy of the result. For instance, clustering techniques, which partition entities into groups with a maximum level of homogeneity within a cluster, may produce inaccurate results. In particular, because the clusters might not be strong when the population is spread over the irrelevant dimensions, the clustering techniques may produce results with data in a higher dimensional space including irrelevant attributes. Dimension reduction improves the performance of data mining techniques by reducing dimensions so that data mining procedures process data with a reduced number of attributes. With dimension reduction, improvement in orders of magnitude is possible.
The conventional dimension reduction techniques are not easily applied to data mining applications directly (i.e., in a manner that enables automatic reduction) because they often require a priori domain knowledge and/or arcane analysis methodologies that are not well understood by end users. Typically, it is necessary to incur the expense of a domain expert with knowledge of the data in a database who determines which attributes are important for data mining. Moreover, conventional dimension reduction techniques are not designed for processing the large datasets that data mining processes.
Some statistical analysis techniques, such as correlation tests, have been applied for dimension reduction. However, these are ad hoc and assume a priori knowledge of the dataset which can not be assumed to always be available.
Some automatic procedures have been proposed for dimension reduction for exploratory analysis of multivariate datasets. The Principal Components Analysis technique reduces dimensions based on the proportion of total variance of each attribute. Attributes with higher proportion of total variance are selected as principal components. Projection pursuit, one technique of automatic selection, was proposed in J. H. Friedman and J. W. Tukey, A Projection Pursuit Algorithm for Exploratory Data Analysis, IEEE Transactions on Computers, 1974, Vol. 23, pp. 881-889, which is incorporated by reference herein. Projection pursuit reveals structure in the original data by offering selected low-dimensional orthogonal projections of the data for inspection. Projection pursuit makes automatic selections by the local optimization over projection directions of an index of interestingness.
However, these techniques are not practically applicable for data mining problems that deal with very large datasets, commonly with millions to billions of records and a large number of attributes. Instead, these techniques are mainly designed for small datasets of hundreds to thousands of records, typically having fewer than ten attributes.
Some recent work in data mining research was designed for the discovery of association rules for large datasets. A rule is a grouping of attribute value pairs. Houtsma and Swami developed a set-oriented association rule discovery technique SETM, which is described in M. Houtsma and A. Swami, Set-Oriented Mining for Association Rules in Relational Databases, Research Report RJ 9567, October 1993, IBM Almaden Research Center, [hereinafter "Houtsma and Swami"], which is incorporated by reference herein. Houtsma and Swami showed that association can be carried out by a general relational query language, such as SQL. The set-oriented nature of SETM simplifies the technique and facilitates extensions such as parallelization.
The Apriori and AprioriTid techniques are described in R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and Verkamo A. I., Fast Discovery of Association Rules, Advances in Knowledge Discovery and Data Mining, Chap. 12, AAAI/MIT Press, 1995, which is incorporated by reference herein. Apriori and AprioriTid offer improved performance by reducing the number of association rules generated in each pass with small support. Moreover, AprioriTid makes use of the rules generated in the previous pass instead of accessing the whole database again for the next pass.
There is a need in the art for improved dimension reduction for use in data mining with large datasets.