The present invention concerns database analysis and more particularly concerns an apparatus and method for clustering of data into groups that capture important regularities and characteristics of the data.
Large data sets are now commonly used in most business organizations. In fact, so much data has been gathered that asking even a simple question about the data has become a challenge. The modern information revolution is creating huge data stores which, instead of offering increased productivity and new opportunities, are threatening to drown the users in a flood of information. Tapping into large databases for even simple browsing can result in an explosion of irrelevant and unimportant facts. Even people who do not xe2x80x98ownxe2x80x99 large databases face the overload problem when accessing databases on the Internet. A large challenge now facing the database community is how to sift through these databases to find useful information.
Existing database management systems (DBMS) perform the steps of reliably storing data and retrieving the data using a data access language, typically SQL. One major use of database technology is to help individuals and organizations make decisions and generate reports based on the data contained in the database.
An important class of problems in the areas of decision support and reporting are clustering (segmentation) problems where one is interested in finding groupings (clusters) in the data. Clustering has been studied for decades in statistics, pattern recognition, machine learning, and many other fields of science and engineering. However, implementations and applications have historically been limited to small data sets with a small number of dimensions.
Each cluster includes records that are more similar to members of the same cluster than they are similar to rest of the data. For example, in a marketing application, a company may want to decide who to target for an ad campaign based on historical data about a set of customers and how they responded to previous campaigns. Other examples of such problems include: fraud detection, credit approval, diagnosis of system problems, diagnosis of manufacturing problems, recognition of event signatures, etc. Employing analysts (statisticians) to build cluster models is expensive, and often not effective for large problems (large data sets with large numbers of fields). Even trained scientists can fail in the quest for reliable clusters when the problem is high-dimensional (i.e. the data has many fields, say more than 20).
Clustering is a necessary step in the mining of large databases as it represents a means for finding segments of the data that need to be modeled separately. This is an especially important consideration for large databases where a global model of the entire data typically makes no sense as data represents multiple populations that need to be modeled separately. Random sampling cannot help in deciding what the clusters are. Finally, clustering is an essential step if one needs to perform density estimation over the database (i.e. model the probability distribution governing the data source). Applications of clustering are numerous and include the following broad areas: data mining, data analysis in general, data visualization, sampling, indexing, prediction, and compression. Specific applications in data mining including marketing, fraud detection (in credit cards, banking, and telecommunications), customer retention and churn minimization (in all sorts of services including airlines, telecommunication services, internet services, and web information services in general), direct marketing on the web and live marketing in Electronic Commerce.
Clustering is an important area of application for a variety of fields including data mining, statistical data analysis, compression, and vector quantization. Clustering has been formulated in various ways. The fundamental clustering problem is that of grouping together (clustering) data items that are similar to each other. The most general approach to clustering is to view it as a density estimation problem. We assume that in addition to the observed variables for each data item, there is a hidden, unobserved variable indicating the xe2x80x9ccluster membershipxe2x80x9d of the given data item. Hence the data is assumed to arrive from a mixture model and the mixing labels (cluster identifiers) are hidden. In general, a mixture model M having K clusters Ci, i=1, . . . , K, assigns a probability to a data point x as follows:                               Pr          ⁢                      xe2x80x83                    ⁢                      (            x            "RightBracketingBar"                    ⁢          M                )            =                        ∑                      i            =            1                    K                ⁢                                            W              i                        ·                          Pr              (              x              "RightBracketingBar"                                ⁢          Ci                      ,    M    )
where Wi are called the mixture weights. The problem then is estimating the parameters of the individual Ci. Usually it is assumed that the number of clusters K is known and the problem is to find the best parameterization of each cluster model. A popular technique for estimating the model parameters (including cluster parameters and mixture weights) is the EM algorithm (see P. Cheeseman and J. Stutz, xe2x80x9cBayesian Classification (AutoClass): Theory and Resultsxe2x80x9d, in in Advances in Knowledge Discovery and Data Mining, Fayyad, U., G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy( Eds.), pp. 153-180. MIT Press, 1996; and A. P. Dempster, N. M. Laird, and D. B. Rubin, xe2x80x9cMaximum Likelihood from Incomplete Data via the EM algorithmxe2x80x9d. Journal of the Royal statistical Society, Series B, 39(1): 1-38, 1977). There are various approaches to solving the optimization problem of determining (locally) optimal values of the parameters given the data. The iterative refinement approaches are the most effective. The basic algorithm goes as follows:
1. Initialize the model parameters, producing a current model.
2. Decide memberships of the data items to clusters, assuming that the current model is correct.
3. Re-estimate the parameters of the current model assuming that the data memberships obtained in 2 are correct, producing new model.
4. If current model and new model are sufficiently close to each other, terminate, else go to 2.
The most popular clustering algorithms in the pattern recognition and statistics literature belong to the above iterative refinement family: the K-Means algorithm J. MacQueen, xe2x80x9cSome methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Volume I, Statistics, L. M. Le Cam and J. Neyman (Eds.). University of California Press, 1967. There are many variants of these that iteratively refine a model by rescanning the data many times. These algorithms have found many applications recently, including in industry and science. The difference between the EM and K-Means is the membership decision (step 2). In K-Means, a data item belongs to a single cluster, while in EM each data item is assumed to belong to every cluster but with a different probability. This of course affects the update step (3) of the algorithm. In K-Means each cluster is updated based strictly on its membership. In EM each cluster is updated by the entire data set according to the relative probability of membership.
The invention represents a methodology for scaling clustering algorithms to large databases. The invention enables effective and accurate clustering in one or less scans of a database. Use of the invention results in significantly better performance than prior art schemes that are based on random sampling. These results are achieved with significantly less memory requirement and acceptable accuracy in terms of approaching the true solution than if one had run the clustering algorithm on the entire database.
Known methods can only address small databases (ones that fit in memory) or resort to sampling only a fraction of the data. The disclosed invention is based on the concept of retaining in memory only the data points that need to be present in memory. The majority of the data points are summarized into a condensed representation that represents their sufficient statistics. By analyzing a mixture of sufficient statistics and actual data points, significantly better clustering results than random sampling methods are achieved and with similar lower memory requirements. The invention can typically terminate well before scanning all the data in the database, hence gaining a major advantage over other scalable clustering methods that require at a minimum a full data scan.
The invention concerns a framework that supports a wide class of clustering algorithms. The K-means algorithm as an example clustering algorithm that represents one specific embodiment of this framework. The framework is intended to support a variety of algorithms that can be characterized by iteratively scanning data and updating models. We use K-Means since it is well-known and established clustering method originally known as Forgy""s method and has been used extensively in pattern recognition. It is a standard technique for clustering, used in a wide array of applications and even as way to initialize the more expensive EM clustering algorithm.
When working over a large data store, one needs to pay particular attention to certain issues of data access. A clustering session may take days or weeks, and it is often desirable to update existing models as data arrives. A list of desirable data mining characteristics follows: The invention satisfies all these:
1. Clustering should run within one scan (or less) of the database if possible: a single data scan is considered costly, early termination if appropriate is highly desirable.
2. On-line xe2x80x9canytimexe2x80x9d behavior: a xe2x80x9cbestxe2x80x9d answer is always available from the system, with status information on progress, expected remaining time, etc.
3. Suspendable, stoppable, resumable; incremental progress saved to resuming a stopped job.
4. An ability to incrementally incorporate additional data with existing models efficiently.
5. Should work within confines of a given limited RAM buffer.
6. Utilize variety of possible scan modes: sequential, index, and sampling scans if available.
7. Should have the ability to operate with forward-only cursor over a view of the database. This is necessary since the database view may be a result of an expensive join query, over a potentially distributed data warehouse, with much processing required to construct each row (case).
The technique embodied in the invention relies on the observation that clustering techniques do not need to rescan all the data items as it is originally defined and as implemented in popular literature and statistical libraries and analysis packages. The disclosed process may be viewed as an intelligent sampling scheme that employs some theoretically justified criteria for deciding which data can be summarized and represented by a significantly compressed set of sufficient statistics, and which data items must be carried in computer memory, and hence occupying a valuable resource. On any given iteration of the invention, we partition the existing data samples intro three subsets: A discard set (DS), a compression set (CS), and a retained set (RS). For the first two sets, we discard the data but keep representative sufficient statistics that summarize the subsets. The last, RS, set is kept in memory. The DS is summarized in a single set of sufficient statistics. The compression set CS is summarized by multiple sufficient statistics representing subclusters of the CS data set.
The invention operates by obaining a next available (possibly random) sample from a database to fill free space in buffer. A current model of the clustering is then updated over the contents of the buffer. Elements of the new sample are identified to determine whether they need to be retained in the buffer (retained set RS); they can be discarded with updates to the sufficient statistics (discard set DS); or they can be reduced via compression and summarized as sufficient statistics (compression set CS). Once this has been done a determination is made to see if a stopping criteria is satisfied. If so terminate clustering, if not then sample more data.
The exemplary embodiment satisfies the above-mentioned important issues faced during data mining. A clustering session on a large database can take days or even weeks. It is often desirable to update the clustering models as the data arrives and is stored. It is important in this data mining environment to be able to cluster in one scan (or less) of the database. A single scan is considered costly and clustering termination before one complete scan is highly desirable.
An exemplary embodiment of the invention includes a model optimizer. A multiple number of different clustering models are simultaneously generated in one or less scans of the database. The clustering analysis stops when one of the models reaches a stopping criteria. Alternately, the clustering can continue until all of the multiple models are complete as judged by the stopping criteria.
These and other objects, advantages and features of the invention will be better understood from a detailed description of an exemplary embodiment of the invention which is described in conjunction with the accompanying drawings.