The following relates to the informational, data storage, and related arts.
Clustering is a useful operation in the informational arts. A clustering operation receives a set of objects, such as documents, images, or so forth, and groups the objects according to similarity as measured by features of the objects.
A known clustering approach for objects is spectral clustering. In this approach, a similarity matrix is constructed in which the matrix elements store similarity measures between objects. The similarity matrix is symmetric, in that the value of matrix element (i,j) equals the value of the matrix element (j,i). In some suitable formulations of the spectral clustering approach, a symmetric similarity matrix is decomposed using eigenvalue decomposition to generate a product of matrices A·D·AT where D is a diagonal matrix whose diagonal elements are eigenvalues of the symmetric similarity matrix, A is another matrix, and the superscript “T” denotes a transpose operation. In the spectral clustering paradigm, the columns of the matrix A are interpreted to associate objects with clusters.
The spectral clustering approach has certain disadvantages. The condition of a symmetric similarity matrix may not be satisfied for certain types of similarity measures. The eigenvalue decomposition processing is also computationally expensive, especially for large-scale clustering involving objects numbering in the thousands, tens of thousands, or more. Further, spectral clustering is strongly dependent upon the number of clusters, which must be selected a priori.
Another approach related to clustering is probabilistic latent semantic analysis (PLSA), which has been applied to text documents. In PLSA, text documents are represented as histograms of word occurrences, sometimes called “bag of words” representations. The histogram representing each document can be thought of as a vector having vector elements corresponding to words and vector element values being the counts of the word occurrences. For example, if the vector element indexed i=5 corresponds to the word “cat”, and a given document has fifty-five occurrences of the word “cat”, then in the bag-of-words representation of the document the vector element i=5 has the value fifty-five. Documents relating to similar topics are expected to have similar word distributions, i.e. similar bag-of-words representations. Clusters are also represented by bag-of-words representations for which the vector element values can be thought of as being indicative of the expected counts of word occurrences for documents belonging to the cluster. The PLSA approach employs an iterative expectation-maximization (E-M) algorithm to optimize the clusters, wherein the bag-of-words representation of each cluster is computed from the counts of word occurrences belonging (in a probabilistic sense) to that cluster. Non-Negative Matrix Factorization and PLSA were independently discovered: PLSA corresponds to the Maximum Likelihood Estimation of a probabilistic model, whereas the Non-Negative Matrix Factorization algorithm has a geometric interpretation in terms of minimization of the Kullback-Leibler divergence between the word concurrence matrix and a low-rank non-negative matrix. Both interpretations lead to the same algorithm, and we treat them as equivalent in the following.
The Non-Negative Matrix Factorization approach for clustering text-based documents is not readily generalized to objects of arbitrary type, since it relies upon counts of word occurrences at both the vector representation level and at the functional level in that the effectiveness of the clustering is reliant upon the word count histograms being “distinctive” for different clusters. Advantageously, documents relating to distinct different subjects typically use distinctively different subject matter-related vocabularies. As a result, the bag-of-words representations are typically sparse vectors having high word counts for words of the subject matter vocabulary, and having zero or very low word counts for words that are not part of the subject matter vocabulary. Feature sets used to characterize objects of types other than textual documents sometimes do not have such a high level of sparsity.