Clustering items based on some notion of similarity is a problem that arises frequently in many applications. For example, clustering documents into groups of related documents is required for information retrieval applications, document analysis applications and other tasks. The items to be clustered may be documents, emails, web pages, advertisements, images, videos, or any other types of items. Clustering may also be referred to as categorizing or classifying.
Some previous approaches have involved supervised classification schemes. In these schemes manual labeling of a significant portion of the items to be classified is required in order to train a machine learning system to carry out the classification automatically. However, this approach is not practical for very large collections of items such as in web-scale applications. In such situations, it is not practical to provide a manual labeling of a significant portion of the items.
Unsupervised clustering approaches are also known whereby the clustering system is free to create whatever categories best fit the data. Examples of such approaches include k-means clustering and agglomerative clustering. However, many of these approaches do not scale up well for large data sets (hundreds of thousands of items to be clustered into hundreds of clusters) in that the training times required are very long and/or the quality of the results are poor.
Another type of unsupervised clustering approach has involved forming a clustering model using a mixture of Bernoulli profiles and learning the optimal values of the model parameters using maximum likelihood methods. Such maximum likelihood methods include direct gradient ascent and expectation maximization (EM). However, such maximum likelihood methods require several passes over the data during training in order to converge and so these approaches are not suitable for extremely large data sets. In these approaches initialization is crucial due to multiple modes of the likelihood but this is very difficult to achieve in applications involving high dimensional data.
The embodiments described herein are not limited to implementations which solve any or all of the disadvantages of known clustering systems.