This invention relates generally to the clustering of very large data sets, such as transactional, categorical, and binary data, and more particularly to such clustering using error-tolerant frequent item sets.
With the advent of the Internet, and especially electronic commerce (xe2x80x9ce-commercexe2x80x9d) over the Internet, the use of data analysis tools, such as data mining, has increased. In e-commerce and other Internet and non-Internet applications, databases are generated and maintained that have large amounts of information, so that they can be analyzed, or xe2x80x9cmined,xe2x80x9d to learn additional information regarding customers, users, products, etc. That is, data analysis tools provide for leveraging the data already contained in databases to learn new insights regarding the data by uncovering patterns, relationships, or correlations.
A common data analysis operation is data clustering, which is also known within the art as database segmentation or data segmentation. Clustering targets finding groups of records such that members of each cluster are more similar to records within the cluster than they are to records in other clusters. Clustering allows for frequent item sets within the data to be detected once the clusters are discovered. Frequent items sets are sets of items (combinations of attributes of entities or records) that occur with significant frequency in the data, Finding such frequent groups, such as groups of users, groups of purchasers, etc., instead of focusing on just the items themselves (products or web pages) allows for new insights to be obtained from the data. For example, a purchaser who has purchased items X and Y may also be predicted as likely to purchase item Z, based on other purchasers who have purchased items X and Y also having had purchased item Z. Thus, a retailer knowing this information may be compelled directly advertise item Z to this purchaser, or perhaps make a special offer or customized coupon for it, or even discount one of the items and hope to make margin on the other correlated items in the cluster.
Many clustering approaches, such as the Expectation Maximization (EM) or the K-Means approaches known within the art, typically require an initial specification of the full clustering model. The initial specification may be randomly selected or generated by some other means. Once this initial model is specified, the approaches iteratively refine the initial model to maximize the fit of the clustering model to the data. A drawback to such approaches, however, is that the function measuring the fit of the clustering model to the data has many locally optimal solutions that are globally sub-optimal. The clustering approach can only guarantee convergence to a locally optimal solution, and not a globally optimal solution. Hence, many other better solutions may be missed simply because the initial model was not good enough. Thus, the initial model selection plays a large role in determining the quality of the solution obtained.
Usually the number of local solutions is very large when fitting databases of even modest size to a model. Many of the local solutions are often unsatisfactory. For example, when clustering high-dimensional sparse databasesxe2x80x94that is, databases where each record (e.g. customer) only specifies values for a very small subset of all possible attributes or items (e.g. products). For example, of 100,000 products available in the store, each customer (record) usually purchases only 5 or 6 items. Another example is web browsing: out of millions of possible pages on the web, a typical user only visits a tiny fraction of them.xe2x80x94In such a situation, there may exist many local clustering solutions that have empty clusters, which are clusters containing no records of data. Another property that makes the clustering problem difficult is skewed distributions over the items (attributes): a predetermined item dominates most items of the databases, such that variance therefrom is infrequent. When the frequency of items drops off geometrically, e.g. frequency of item i is proportional to 1/i, the data is said to obey a Zipf distribution. The Zipf distribution is a skewed distribution and is observed in web-browsing data, product-purchasing data, text-data keyword count data, and many other sparse data bases in practice.
Given the above difficulties, a common approach within the prior art is to search for good clustering solutions by re-running the clustering approach from many different random initial cluster models, in the hope of finding a good solution. However, there are no guarantees, nor probabilistic arguments, that a good clustering solution will be found without employing methods exponential in running time. These methods are infeasible to apply to large databases. Even re-running the clustering approach from many different initial cluster models is computationally prohibitive for large-scale databases. For example, given a database of even modest size, the time required for running a clustering approach from a single initial clustering model can take hours or even days. Applying the approach from many different randomly selected initial clustering models can thus take many days. And again, there is no guarantee that a good solution will be found.
For this and other reasons, therefore, there is a need for the present invention.
The invention relates to data clustering using error-tolerant frequent item sets (denoted as ETF""s). In one embodiment, a method first determines a plurality of weak error-tolerant frequent item sets, which are strongly tolerant of errors, and then determines a plurality of strong error-tolerant frequent item sets (ETF""s) therefrom, which are less tolerant of errors. The resulting error-tolerant frequent item sets can be used as an initial model for a standard clustering approach such as the EM approach, or may themselves be used as the end clusters. Furthermore, in one embodiment, the data covered by the strong ETF""s is removed from the data, and the process is repeated, until no more weak ETF""s can be found therein. The data can be binary, categorical, or continuous data.
Embodiments of the invention provide for advantages not found within the prior art. Traditional frequent item sets are defined to be exact: they are supported only by the data records that have all the items appearing in the frequent item set. Efficient algorithms for discovering frequent item sets rely on this definition. The definition of frequent item sets is generalized herein to Error-Tolerant Frequent Item sets (ETF) and an efficient algorithm is provided for computing them over very large databases. Relaxing the error tolerance of ETF""s has been determined not only to lead to more general summaries of the data than traditional frequent item sets, but also to lead to initial clustering solutions that are much better than known methods for selecting initial clustering solutions. This in turn dramatically reduces the computational resources necessary to obtain improved clustering solutions. Furthermore, the chance of converging to a clustering solution with empty clusters has been found to be significantly reducedxe2x80x94an important problem to avoid, while measures quantifying the fit of the clustering model to the data have been found to be maximized.