Large data sets are now commonly used by many business organizations. In fact, so much data is contained in large database that asking even a simple question about the data has become a challenge. Large data stores can overwhelm users accessing the data with a flood of information. Tapping into large databases for even simple browsing can result in a return of irrelevant and unimportant facts. Even people who do not ‘own’ large databases face the overload problem when accessing databases on the Internet. A large challenge now facing the database community is how to sift through these databases to find useful information.
Existing database management systems (DBMS) perform the steps of reliably storing data and retrieving the data using a data access language, typically SQL. One major use of database technology is to help individuals and organizations make decisions and generate reports based on the data contained in the database.
An important class of problems in the areas of decision support and reporting are clustering (segmentation) problems where one is interested in finding groupings (clusters) in the data. Data clustering has been used in statistics, pattern recognition, machine learning, and many other fields of science and engineering. However, implementations and applications have historically been limited to small data sets with a small number of dimensions or fields.
Each data cluster includes records that are more similar to members of the same cluster than they are similar to the rest of the data. For example, in a marketing application, a company may want to decide who to target for an ad campaign based on historical data about a set of customers and how they responded to previous campaigns. Employing analysts (statisticians) to build cluster models is expensive, and often not effective for large problems (large data sets with large numbers of fields). Even trained scientists can fail in the quest for reliable clusters when the problem is high-dimensional (i.e. the data has many fields, say more than 20).
Applications of clustering are numerous and include the following broad areas: data mining, data analysis in general, data visualization, sampling, indexing, prediction, and compression. Specific applications in data mining including marketing, fraud detection (in credit cards, banking, and telecommunications), customer retention and churn minimization (in all sorts of services including airlines, telecommunication services, internet services, and web information services in general), direct marketing on the web and live marketing in Electronic Commerce.
Clustering has been formulated in various ways. The fundamental clustering problem is that of grouping together (clustering) data items that are similar to each other. The most general approach to clustering is to view it as a density estimation problem. We assume that in addition to the observed variables for each data item, there is a hidden, unobserved variable indicating the “cluster membership” of the given data item. Hence the data is assumed to arrive from a mixture model and the mixing labels (cluster identifiers) are hidden. In general, a mixture model M having K clusters Ci, i=1, . . . , K, assigns a probability to a particular data record or point x as follows:
  Pr  (            x      ⁢                      M        )              =                  ∑                  i          =          1                K            ⁢                          ⁢                        W          i                ·                  Pr          (                      x            ⁢                                                                          C                  i                                ,                M                            )                                          where Wi are called the mixture weights.The problem then is estimating the parameters of the individual Ci. Usually it is assumed that the number of clusters K is known and the problem is to find the best parameterization of each cluster model. A popular technique for estimating the model parameters (including cluster parameters and mixture weights) is the EM algorithm (see P. Cheeseman and J. Stutz, “Bayesian Classification (AutoClass): Theory and Results”, in Advances in Knowledge Discovery and Data Mining, Fayyad, U., G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy(Eds.), pp. 153-180. MIT Press, 1996; and A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM algorithm”. Journal of the Royal statistical Society, Series B, 39(1): 1-38, 1977).
There are various approaches to solving the optimization problem of determining (locally) optimal values of the parameters given the data. The iterative refinement approaches are the most effective. The basic algorithm goes as follows:
1. Initialize the model parameters, producing a current model.
2. Decide memberships of the data items to clusters, assuming that the current model is correct.
3. Re-estimate the parameters of the current model assuming that the data memberships obtained in 2 are correct, producing a new model.
4. If the current model and the new model are sufficiently close to each other, terminate, else go to 2. In this step ‘close’ is evaluated by a predefined one of multiple possible stopping criteria.
The most popular clustering algorithms in the pattern recognition and statistics literature belong to the above iterative refinement family: the K-Means algorithm. See E. Forgy, “Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications”, Biometrics 21:768. 1965 or J. MacQueen, “Some methods for classification and analysis of multivariate observations.” In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Volume I, Statistics, L. M. Le Cam and J. Neyman (Eds.). University of California Press, 1967.
There are other variants of clustering procedures that iteratively refine a model by rescanning the data many times. The difference between the EM and K-Means is the membership decision (step 2). In K-Means, a data item belongs to a single cluster, while in EM each data item is assumed to belong to every cluster but with a different probability. This of course affects the update step (3) of the algorithm. In K-Means each cluster is updated based strictly on its membership. In EM each cluster is updated by contributions from the entire data set according to the relative probability of membership of each data record in the various clusters.
Data records stored in a database typically contain a number of attributes that relate to different characteristics of a common observation or occurrence. Often the attributes for a given collection of data records are a mixture of discrete and continuous data fields. Discrete data attributes refers to instances wherein the values of a particular field in the database are finite and not ordered. For instance, color is a discrete feature having possible values {green, blue, red, white, black} and it makes no sense to impose an ordering on these values (i.e. green>blue?). Continuous attributes are fields of a data record that can be ordered, summed, averaged etc. Age is an example of such an attribute. A census database of records for individuals would typically contain a mixture of discrete and continuous data attributes.