1. Field of the Invention
The present invention relates to the field of computing. More particularly, the present invention relates to an approach for organizing data within a dataset for data mining.
2. Description of the Related Art
Clustering is a descriptive task associated with data mining that identifies homogeneous groups of objects in a dataset. Clustering techniques have been studied extensively in statistics, pattern recognition, and machine learning. Examples of clustering applications include customer segmentation for database marketing, identification of sub-categories of spectra from the database of infra-red sky measurements, and identification of areas of similar land use in an earth observation database.
Clustering techniques can be broadly classified into partitional techniques and hierarchial techniques. Partitional clustering partitions a set of objects into K clusters such that the objects in each cluster are more similar to each other than to objects in different clusters. For partitional clustering, the value of K can be specified by a user, and a clustering criterion must be adopted, such as a mean square error criterion, like that disclosed by P. H. Sneath et al., Numerical Taxonomy, Freeman, 1973. Popular K-means methods, such as the FastClust in SAS Manual, 1995, from the SAS Institute, iteratively determine K representatives that minimize the clustering criterion and assign each object to a cluster having its representative closest to the cluster. Enhancements to partitional clustering approach for working on large databases have been developed, such as CLARANS, as disclosed by R. T. Ng et al., Efficient and effective clustering methods for spatial data mining, Proc. of the VLDB Conference, Santiago, Chile, September 1994; Focussed CLARANS, as disclosed by M. Ester et al., A database interface for clustering in large spatial databases, Proc. of the 1st Int'l Conference on Knowledge Discovery in Databases and Data Mining, Montreal, Canada, August 1995; and BIRCH, as disclosed by T. Zhang et al., BIRCH: An efficient data clustering method for very large databases, Proc. of the ACM SIGMOD Conference on Management Data, Montreal, Canada, June 1996.
Hierarchial clustering is a nested sequence of partitions. An agglomerative, hierarchial clustering starts by placing each object in its own atomic cluster and then merges the atomic clusters into larger and larger clusters until all objects are in a single cluster. Divisive, hierarchial clustering reverses the process by starting with all objects in cluster and subdividing into smaller pieces. For theoretical and empirical comparisons of hierarchical clustering techniques, see for example, A. K. Jain et al., Algorithms for Clustering Data, Prentice Hall, 1988, P. Mangiameli et al., Comparison of some neutral network and hierarchical clustering methods, European Journal of Operational Research, 93(2):402-417, September 1996, P. Michaud, Four clustering techniques, FGCS Journal, Special Issue on Data Mining, 1997, and M. Zait et al., A Comparative study of clustering methods, FGCS Journal, Special Issue on Data Mining, 1997.
Emerging data mining applications place special requirements on clustering techniques, such as the ability to handle high dimensionality, assimilation of cluster descriptions by users, description minimation, and scalability and usability. Regarding high dimensionality of data clustering, an object typically has dozens of attributes in which the domains of the attributes are large. Clusters formed in a high-dimensional data space are not likely to be meaningful clusters because the expected average density of points anywhere in the high-dimensional data space is low. The requirement for high dimensionality in a data mining application is conventionally addressed by requiring a user to specify the subspace for cluster analysis. For example, the IBM data mining product, Intelligent Miner described in the IBM Intelligent Miner User's Guide, version 1 release 1, SH12-6213-00 edition, July 1996, and incorporated by reference herein, allows specification of "active" attributes for defining a subspace in which clusters are found. This approach is effective when a user can correctly identify appropriate attributes for clustering.
A variety of approaches for reducing dimensionality of a data space have been developed. Classical statistical techniques include principal component analysis and factor analysis, both of which reduce dimensionality by forming linear combinations of features. For example, see R. O. Duda et al., Pattern Classification and Scene Analysis, John Wiley and Sons, 1973, and K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, 1990. For the principal component analysis technique, also known as Karhunen-Loeve expansion, a lower-dimensional representation is found that accounts for the variance of the attributes, whereas the factor analysis technique finds a representation that accounts for the correlations among the attributes. For an evaluation of different feature selection methods, primarily for image classification, see A. Jain et al., Algorithms for feature selection: An evaluation, Technical report, Department of Computer Science, Michigan State University, East Lansing, Mich., 1996. Unfortunately, dimensionality reductions obtained using these conventional approaches conflict with the requirements placed on the assimilation aspects of data mining.
Data mining applications often require cluster descriptions that can be assimilated and used by users because insight and explanations are the primary purpose for data mining. For example, see U. M. Fayyad et al., Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996. Clusters having decision surfaces that are axis parallel and, hence, can be described as Disjunctive Normal Form (DNF) expressions, become particularly attractive for user assimilation. Nevertheless, even while a description is a DNF expression, there are clusters that are poorly approximated poorly, such as a cigar-shaped cluster when the cluster description is restricted to be a rectangular box. On the other hand, the same criticism can also be raised against decision-tree and decision-rule classifiers, such as disclosed by S. M. Weiss et al., Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems, Morgan Kaufman, 1991. However, in practice, the classifiers exhibit competitive accuracies when compared to techniques, such as neural nets, that generate considerably more complex decision surfaces, as disclosed by D. Michie, Machine Learning, Neural and Statistical Classification, Ellis Horwood, 1994.
The merits of description minimization have been eloquently presented by J. Rissanen, Stochastic Complexity in Statistical Inquiry, World Scientific Publ. Co., 1989. The principal assertion, also known as Occam's razor, is that if two different solutions in the same representation describe a particular data, the less complex solution is more accurate. Further, smaller descriptions lend themselves for user comprehension.
Lastly, a clustering technique should be fast and scale with the number of dimensions and the size of a dataset or input data. It should also be insensitive to the order in which data records are presented.
As a basis for better understanding the problem of finding clusters in subspaces, a formal model is presented. Let domain A={a.sub.1, a.sub.2, . . . , a.sub.m } be a set of literals, and A={A.sub.1, A.sub.2, . . . , A.sub.m } be a set of bounded domains. An element a.sub.i .epsilon.A is called an attribute and its domain is A.sub.i. Assume that (a.sub.1, a.sub.2, . . . , a.sub.m) defines a texicographic ordering on attributes in domain A. For now, only numeric attributes are assumed. An input consists of a set of m-dimensional vectors points) V={v.sub.1, v.sub.2, . . . , v.sub.n }, where v.sub.i =&lt;v.sub.i1, v.sub.i2, . . . , v.sub.im &gt;. The jth component of vector v.sub.i is drawn from domain A.sub.j.
An m-dimensional data space S=A.sub.1 x A.sub.2 x . . . x A.sub.m can be viewed as being partitioned into non-overlapping rectangular units. Each unit has the form {r.sub.l, . . . , r.sub.m }, r.sub.j =&lt;a.sub.j, l.sub.j, u.sub.j &gt; such that l.sub.j .ltoreq.u.sub.j, where l.sub.j, u.sub.j .epsilon. A.sub.j and a.sub.h for j.noteq.h. The units have been obtained by some partitioning into intervals (e.g., equi-width, user-specified, etc.) of each of the A.sub.i. The partitioning scheme can be different for each A.sub.i.
A point v.sub.i =&lt;v.sub.i1, v.sub.i2, . . . , v.sub.im &gt; is said to be contained in a unit {r.sub.1, . . . , r.sub.m } if l.sub.j .ltoreq.v.sub.ij .ltoreq.u.sub.j for all r.sub.j. The density of a unit is defined to be the fraction of total data points contain in the unit. The average density of the data space S is the average of the densities of all the units included in S. A unit is defined to be dense if its density is greater than .lambda. fraction of the average density of S, where .lambda. is a model parameter.
A cluster is a set of connected dense units. Two units are connected if they have a common face. Formally, two units, {r.sub.1, . . . , r.sub.m } and {r'.sub.1, . . . , r'.sub.m } are said to be connected if there are m-1 dimensions (assume 1, . . . , m-1 without loss of generality) such that r.sub.j =r'.sub.j and either u.sub.m =l'.sub.m or u'.sub.m =l.sub.m.
A region R in m dimensions is an axis-parallel rectangular m-dimensional set. That is, R={r.sub.1, . . . , r.sub.m }, r.sub.j =&lt;a.sub.j, l.sub.j, u.sub.j &gt; for l.sub.j, u.sub.j .epsilon.A.sub.j such that l.sub.j .ltoreq.u.sub.j and a.sub.j .noteq.a.sub.h for j.noteq.h. A unit is a special case of a region. Only those regions that can be expressed as unions of units are of interest; henceforth all references to a region herein mean such unions. The size of a region is the number of units contained in the region. Supersets and subsets of a region are regions in the same set of dimensions: a region R'={r'.sub.1, . . . , r'.sub.m } is a superset (respectively, subset) of a region R={r.sub.1, . . . , r.sub.m } if for each j, l'.sub.j .ltoreq.l.sub.j and u'.sub.j .gtoreq.u.sub.j (l'.sub.j .gtoreq.l.sub.j and u'.sub.j .ltoreq.u.sub.j, respectively).
A region R is said to be contained in a cluster C if R.andgate.C=R. A region can be expressed as a DNF expression on intervals of the domains A.sub.i. A region R contained in a cluster C is said to be maximal if no proper superset of R is contained in C. A minimal description of a cluster is a non-redundant covering of the cluster with maximal regions. That is, a minimal description of a cluster C is a set S of maximal regions of the cluster such that their union equals C, but the union of any proper subset of S does not equal C.
Frequently, there is interest in identifying a cluster in a subset of the m dimensions in which there is at least one dense unit. Further, it is assumed that there is one global value of .lambda. for determining dense units. All foregoing definitions carry through after considering appropriate projections from the original data space to the subspace of interest. Note that if a unit is dense in a set of dimensions a.sub.1, . . . , a.sub.k, its projections in all subsets of this set of dimensions are also dense.
The foregoing clustering model can be considered nonparametric in that mathematical forms are neither assumed for data distribution, nor for clustering criteria Instead, data points are separated according to the valleys of a density function, such as disclosed by K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, 1990. One example of a density-based approach to clustering is DBSCAN, as disclosed by M. Ester et al., A density-based algorithm for discovering clusters in large spatial databases with noise, Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August 1995. This approach defines a cluster as a maximal set of density-connected points. However, the application domain for DBSCAN is spatial databases and with interest in finding arbitrarily-shaped clusters.
Several other techniques for nonparametric clustering are based on estimating density gradient for identifying valleys in the density function. These techniques are computationally expensive and generally result in complex cluster boundaries, but which may provide the most correct approach for certain data mining applications.
The problem of covering marked boxes in a grid with rectangles has been addressed in logic minimization by, for example, S. J. Hong, MINI: A heuristic algorithm for two-level logic minimization, Selected Papers on Logic Synthesis for Integrated Circuit Design, R. Newton, editor, IEEE Press, 1987. It is also related to the problem of constructive solid geometry (CSG) formula in solid-modeling, such as disclosed by D. Zhang et al., Csd set-theoretic solid modelling and NC machining of blend surfaces, Proceedings of the Second Annual ACM Symposium on Computational Geometry, pages 314-318, 1986. These techniques have also been applied for inducing decision rules from examples, such as disclosed by S. J. Hong, R-MINI: A heuristic algorithm for generating minimal rules from examples, 3rd Pacific Rim Int'l Conference on AI, August 1994. However, MINI and R-MINI are quadratic in the size of input (number of records). Computational geometry literature also contains algorithms for covering points in two-or three-dimensions with minimum number of rectangles, for example, see D. S. Franzblau et al., An algorithm for constructing regions with rectangles: Independence and minimum generating sets for collections of intervals, Proc. of the 6th Annual Symp. on Theory of Computing, pages 268-276, Washington D.C., April 1984; R. A. Reckhow et al., Covering simple orthogonal polygon with a minimum number of orthogonally convex polygons, Proc. of the ACM 3rd Annual Computational Geometry Conference, pages 268-277, 1987; and V. Soltan et al., Minimum dissection of rectilinear polygon with arbitrary holes into rectangles, Proc. of the ACM 8th Annual Computational Geometry Conference, pages 296-302, Berlin, Germany, June 1992.
Some clustering algorithms used for image analysis also find rectangular dense regions, but they have been designed for low-dimensional datasets. For example, see M. Berger et al., An algorithm for point clustering and grid generation, IEEE Transactions on Systems, Man and Cybernetics, 21(5):1278-86, 1991; P. Schroeter et al., Hierarchical image segmentation by multi-dimensional clustering and orientation-adaptive boundary refinement, Pattern Recognition, 25(5):695-709, May 1995; and S. Wharton, A generalized histogram clustering for multidimensional image data, Pattern Recognition, 16(2):193-199, 1983.
What is needed is an approach for automatically identifying subspaces in which clusters are found in a multi-dimensional data space and that provides description assimilation for a user, description minimization and scales as the size of the data space increases.