This invention relates generally to the clustering of very large data sets, such as transactional, categorical, and binary data, and more particularly to such clustering using an iterative validation with selective iterative sampling approach based on error-tolerant frequent itemsets.
With the advent of the Internet, and especially electronic commerce (xe2x80x9ce-commercexe2x80x9d) over the Internet, the use of data analysis tools, such as data mining, has increased. In e-commerce and other Internet and non-Internet applications, databases are generated and maintained that have large amounts of information, so that they can be analyzed, or xe2x80x9cmined,xe2x80x9d to learn additional information regarding customers, users, products, etc. That is, data mining tools provide for leveraging the data already contained in databases to learn new insights regarding the data by uncovering patterns, relationships, or correlations that might help a business provide improved services, target product offerings, customize web sites, or understand how people use its web site.
A common data analysis operation is data clustering, which is also known within the art as database segmentation or data segmentation. Clustering targets identifying groups of records such that members of each cluster are more similar to records within the cluster than they are to records belonging to other clusters. Clustering is about determining groups of customers (or transactions, or baskets) as opposed to frequent item sets which determines groups of items within the data. Frequent items sets are sets of items (combinations of attributes of entities or records) that occur with significant frequency in the data. Finding clusters, such as groups of users, groups of purchasers, etc., instead of focusing on just the items themselves (products or web pages), allows for new insights to be obtained from the data. For example, a purchaser who has purchased items X and Y may also be predicted as likely to purchase item Z, based on other purchasers who have purchased items X and Y also having had purchased item Z. Thus, a retailer knowing this information may be compelled directly advertise item Z to this purchaser, or perhaps make a special offer or customized coupon for it, or even discount one of the items and hope to make margin on the other correlated items in the cluster.
Many clustering approaches, such as the Expectation Maximization (EM) approach or the K-means clustering algorithm, are known within the art. However, they typically operate on all the data within the database at a given time and require an initial specification of the fall clustering model.
Once this model is specified, the approaches iteratively refine the initial model to maximize the fit of the clustering model with the data. A drawback to such approaches, however, is that the function measuring the fit of the clustering model to the data has many local solutions. The clustering approach can only guarantee convergence to a local solution, and not a globally optimal solution. Hence, many other better solutions may be missed simply because the initial model was not good enough.
Usually the number of local solutions is very large when fitting databases of even modest size to a model. Many of the local solutions are often unsatisfactory. For example, when clustering high-dimensional sparse databasesxe2x80x94that is, databases where each record (e.g. customer) only specifies values for a very small subset of all possible attributes or items (e.g. products). For example, of 100,000 products available in the store, each customer (record) usually purchases only 5 or 6 items. Another example is web browsing: out of millions of possible pages on the web, a typical user only visits a tiny fraction of them.xe2x80x94In such a situation, there may exist many local clustering solutions that have empty clusters, which are clusters containing no records of data. Another property that makes the clustering problem difficult is a skewed distribution over the items (attributes): a predetermined item dominates most items of the databases, such that variance therefrom is infrequent. When the frequency of items drops off geometrically, e.g. frequency of item i is proportional to 1/i, the data is said to obey a Zipf distribution. The Zipf distribution is a skewed distribution and is observed in web-browsing data, product-purchasing data, and text-data.
Given the above difficulties, a common approach to the problem of determining an initial clustering within the prior art is to search for good clustering solutions by re-running the clustering approach from many different random initial cluster models, in the hope of finding a good solution. However, there are no guarantees, nor probabilistic arguments, that a good clustering solution will be found without employing methods exponential in running time. Even re-running the clustering approach from many different initial cluster models is computationally prohibitive for large-scale databases. For example, given a database of even modest size, the time required for running a clustering approach from a single initial clustering model can take hours or even days. Applying the approach from many different randomly selected initial clustering models can thus take many days. And again, there is no guarantee that a good solution will be found.
In addition the problem of clustering large databases is compounded since, even with a good initial cluster model, typical prior art clustering algorithms assume that the data set resides in main memory (each record can be accessed as many times as needed). Since the data of large databases cannot all usually fit into the memory of a computer at one time, constant disk accesses, as manifested by repeated scans or paging of memory to disk, known in the art as xe2x80x9cthrashing,xe2x80x9d results, causing lengthy processing times to complete the clustering. For example, given a database of even modest size, the time required for running a clustering approach can take hours or even days. For this and other reasons, therefore, there is a need for the present invention: to efficiently determine cluster models by using Error Tolerant Frequent Itemsets over large databases using an iterative validation procedure. These models may be used as the final clustering solution or used to provide xe2x80x9cgoodxe2x80x9d initial clustering models to a prior art clustering algorithm.
The invention relates to iterative validation clustering using error-tolerant frequent itemsets (denoted as ETF""s). In one embodiment, a method first determines a sample set of ETF""s within a uniform sample of data within a database. This sample set of ETF""s is validated, which in one embodiment includes testing the sample set of ETF""s against a validation random sample, so that, for example, spurious ETF""s and spurious dimensions (attributes) within the ETF""s are removed. The sample set of ETF""s, as validated, is then added to the set of ETF""s for the database, which is initially set to empty. This process is repeated with additional uniform samples that are mutually exclusive from data satisfying the existing set of ETF""s, to continue making new additions the set of ETF""s for the database, until no additional sample sets can be found.
Embodiments of the invention provide for advantages not found within the prior art. For example, in one embodiment, the uniform sample of data is taken such that it can fit in the memory of the computer on which the method is implemented. For large databases especially, this greatly reduces the amount of time necessary to cluster the data, since computations at any one time are performed only on a sample of data, and thus are performed totally within memory. The multiple iterative samples over mutually exclusive data sets results in allowing the algorithm to get a complete view of the data without ever having to load the entire data set at once. In addition, performance penalties that result from constantly retrieving data from disk (as is done by methods in prior art) are eliminated. Finally, prior art does has dealt exclusively with Frequent Item sets, this invention introduces the novel definition and implementation of the ETF""s, a substantially more powerful generalization of frequent item sets
The invention includes computer-implemented methods, machine-readable media, computerized systems, and computers of varying scopes. Other aspects, embodiments and advantages of the invention, beyond those described here, will become apparent by reading the detailed description and with reference to the drawings.