The present invention concerns database analysis and more particularly concerns apparatus and method for choosing a cluster number for use while clustering data into data groupings that characterize the data.
Large data sets are now commonly used in most business organizations. In fact, so much data has been gathered that asking even a simple question about the data has become a challenge. The modern information revolution is creating huge data stores which, instead of offering increased productivity and new opportunities, are threatening to drown the users in a flood of information. Accessing data in large databases for even simple browsing can result in an explosion of irrelevant and unimportant facts. Users who do not xe2x80x98ownxe2x80x99 large databases face the overload problem when accessing databases on the Internet. A large challenge now facing the database community is how to sift through these databases to find useful information.
Existing database management systems (DBMS) perform the steps of reliably storing data and retrieving the data using a data access language, typically SQL. One major use of database technology is to help individuals and organizations make decisions and generate reports based on the data contained in the database.
An important class of problems in the areas of decision support and reporting are clustering (segmentation) problems where one is interested in finding groupings (clusters) in the data. Clustering has been studied for decades in statistics, pattern recognition, machine learning, and many other fields of science and engineering. However, implementations and applications have historically been limited to small data sets with a small number of dimensions or fields.
Each cluster includes records that are more similar to members of the same cluster than they are similar to rest of the data. For example, in a marketing application, a company may want to decide who to target for an ad campaign based on historical data about a set of customers and how the customers responded to previous ad campaigns. Other examples of such problems include: fraud detection, credit approval, diagnosis of system problems, diagnosis of manufacturing problems, recognition of event signatures, etc. Employing analysts (statisticians) to build cluster models is expensive, and often not effective for large problems (large data sets with large numbers of fields). Even trained scientists can fail in the quest for reliable clusters when the problem is high-dimensional (i.e. the data has many fields, say more than 20).
A goal of automated analysis of large databases is to extract useful information such as models or predictors from the data stored in the database. One of the primary operations in data mining is clustering (also known as database segmentation). Clustering is a necessary step in the mining of large databases as it represents a means for finding segments of the data that need to be modeled separately. This is an especially important consideration for large databases where a global model of the entire data typically makes no sense as data represents multiple populations that need to be modeled separately. Random sampling cannot help in deciding what the clusters are. Finally, clustering is an essential step if one needs to perform density estimation over the database (i.e. model the probability distribution governing the data source).
Applications of clustering are numerous and include the following broad areas: data mining, data analysis in general, data visualization, sampling, indexing, prediction, and compression. Specific applications in data mining including marketing, fraud detection (in credit cards, banking, and telecommunications), customer retention and churn minimization (in all sorts of services including airlines, telecommunication services, internet services, and web information services in general), direct marketing on the web and live marketing in Electronic Commerce.
Many clustering algorithms assume that the number of clusters (usually denoted by the integer K) is known prior to performing the clustering. These prior art clustering procedures then attempt to find a best way to partition the data into the K clusters. In the case where the number of clusters is not known before clustering is started, an outer evaluation loop can be employed which produces, for each value of K, a clustering solution with K clusters or partitions. This solution is then evaluated by a clustering criteria and the value of K producing the best results according to this criteria chosen for the clustering model.
The computational burden of applying this approach in the clustering of large-scale databases is very high. Given a database of even modest size, the time required for prior art clustering procedures to run for a fixed number of clusters K, can be many hours. Iterating over many values of K can result in days of computing to determine a best clustering model. In many real-world applications, the number of clusters residing in a given database is unknown and in many instances difficult to estimate. This is especially true if the number of fields or dimensions in the database is large.
The present invention determines a cluster number K using an incremental process that is integrated with a scalable clustering process particularly suited for clustering large databases. The invention allows for an adjustment of the cluster number K during the clustering process without rescanning data from the database. Unlike prior art looping processes the computational complexity added by the exemplary choice of K process is not unduly burdensome.
A process performed in accordance with the invention starts with an existing cluster number K and explores the suitability of other cluster numbers differing slightly from K using a test set of data obtained from the database. A good estimate of the number of true clusters that are contained in the database is found, even though an initial choice of the cluster number K is not optimal.
A computer system operating in accordance with an exemplary embodiment of the invention computes a candidate cluster set for characterizing a database of data stored on a storage medium. The candidate cluster set includes two or more clustering models that have different number of casters in their model. A data portion is obtained from the database and it is then used to determine the fit of data to each clustering model within the candidate cluster set. The clustering model best fitting the test data is chosen as the optimal clustering model from the candidate cluster set. The cluster number from the selected clustering model is used to update the clustering model output by the computer. Other data ports are obtained from the database and the process of updating the cluster model continues. This updating uses the newly sampled data from the database and sufficient statistics stored in memory derived from other data gathering steps until a specified clustering criteria has been satisfied.
The process of updating the clustering number is based on evaluating a holdout (test) data set to determine if one of the candidate set of cluster models fits the data from the holdout (test) set better than the current model. The holdout data set is read from the database. It can either be used exclusively for cluster number evaluation or it can be used to create a cluster model after it has been used to determine an appropriate model.
The sufficient statistics are maintained in a computer buffer and represent data from the database used in creating a current clustering model. In accordance with one embodiment of the invention, clusters that make up a current clustering model are evaluated as candidate clusters for removal from the current clustering model. This reduces the cluster number. The candidate clustering model (having reduced cluster number) is used to recluster the data from the sufficient statistics stored in a computer buffer. The fit of the data in the holdout (test) set is then compared with the fit from the current clustering model if the holdout data set fits the candidate clustering model better than the current model, the cluster number is reduced.
The sufficient statistics summarizes a number of data subclusters. In an alternate embodiment of the invention these subclusters serve as potential additional clusters that will increase the cluster number of the current clustering model. A candidate cluster model is formed by adding one or more subclusters to the current model and reclustering the data using the sufficient statistics in the computer buffer. The resulting candidate cluster model is used to evaluate the fit of the holdout data set. If the candidate clustering model better fits the data then the candidate clustering model (having larger cluster number) becomes the current clustering model for use in further clustering.