The following relates to the information processing, clustering, density estimation, and related arts.
Two common tasks in information processing are clustering of a set of N objects into K clusters, and density estimation.
In clustering, one has a group of objects each characterized by a set of features (for example, suitably represented as a features vector), and it is desired to divide the objects into K different groups, classes, or clusters. In some approaches, the clustering problem is represented as an optimization problem, in which the log-likelihood function of the form:
                    Θ        =                              ∑                          n              =              1                        N                    ⁢                      log            ⁡                          (                                                ∑                                      k                    =                    1                                    K                                ⁢                                                      w                    k                                    ⁢                                      p                                          k                      ,                      n                                                                                  )                                                          (        1        )            is maximized with respect to the weight parameters wk, k=1, . . . , K, subject to the limits:wk≧0∀k=1, . . . K   (2),and further subject to the normalization condition:
                                          ∑                          k              =              1                        K                    ⁢                      w            k                          =        1.                            (        3        )            A log-likelihood function such as that of Equation (1) subject to the constraints of Equations (2) and (3) is known to be a concave function, and hence the whole optimization problem maximizing (1) under the constraints (2) and (3) is a convex optimization problem. Therefore, the solution of the problem is unique, which simplifies maximization by avoiding problems due to the presence of problematic local (that is, non-global) maxima. Moreover, some optimization problems formulated as log-likelihood function maximization can be configured to be sparse, meaning that only a small number of the wk parameters are non-zero, a condition which promotes computational efficiency.
In a clustering application, the index n=1, . . . , N indexes N objects in a dataset and the index k=1, . . . ,K indexes K candidate cluster centroids. The candidate cluster centroids may be a subset of the objects to be clustered (K<N), the whole set of objects to be clustered (K=N), a disjoint set of objects, or a mix of the objects to be clustered and of objects belonging to a disjoint set. The parameters pk,n represent the probability that the nth object has been generated by the kth cluster. For example, in one generic formulation pk,n∝exp(−γ∥on−ck|2) may be suitable, where on represents the location of the nth object in a vector space (for example, the features vector space), ck represents the location of the kth candidate cluster centroid in the vector space, ∥. . . ∥ represents a suitable distance measure in the vector space, and γ is a non-negative parameter. In a clustering application, K different candidate clusters ck are defined and the log-likelihood function Θ of Equation (1) is maximized respective to the weight parameters wk, k =1, . . . ,K. Once the optimal wk,k=1, . . . ,K have been identified, the clusters for which the weight parameters wk are strictly positive numbers are well identified clusters, whereas if wk=0, the kth cluster is discarded from the set of candidate clusters. Each object indexed by i, i=1, . . . ,n is assigned (in a probabilistic sense) to one or more of the clusters k=1, . . . ,K using the formula
          ⁢            a              k        ,        i              =                            w          k                ⁢                  p                      k            ,            i                                                ∑                                    k              ′                        =            1                    K                ⁢                              w                          k              ′                                ⁢                      p                                          k                ′                            ,              i                                          such that the objects are optimally distributed amongst the clusters.
Density estimation is an application closely related to clustering. In density estimation, it is desired to estimate a Probability Density Function (PDF) that is representative of the distribution of a group of objects or data points. In some density estimation approaches, the PDF is represented as a linear combination of K constituent functions. In these approaches, a log-likelihood function such as of the form given in Equation (1) is again used, but here with the interpretation that the parameters pk,n represent the degree to which the nth object or data point lies within the kth PDF component, and the weight parameters wk,k=1, . . . ,K are the relative weights of the K constituent PDF components in the linear combination. By maximizing the log-likelihood function Θ of Equation (1) respective to the weight parameters wk,k=1, . . . ,K, the PDF defined by the linear combination is optimized to best represent the distribution of the N objects or data points.
While clustering and density estimation are two useful applications of the log-likelihood function Θ of Equation (1), numerous other applications exist. For example, log-likelihood functions find application in information entropy-related problems, maximum likelihood problems, and so forth.
Accordingly, there is substantial technological value in developing computationally efficient methods for maximizing log-likelihood functions. A commonplace approach for maximizing a log-likelihood function is the iterative expectation-maximization (EM) algorithm. However, the speed of convergence of EM for log-likelihood maximization is relatively slow. Convergence speed can be enhanced by setting to zero any wk falling below a selected threshold (such as below 10−3/N). See, e.g., Lashkari et al., “Convex clustering examplar-based models”, NIPS (2007) (available at http://people.csail.mit.edu/polina/papers/LashkariGolland_NIPS07.pdf, last accessed Aug. 14, 2008), which is incorporated herein by reference in its entirety. However, the EM convergence is still relatively slow even with this enhancement. Other approaches for log-likelihood function maximization include various least-squares optimization techniques such as gradient-based approaches. However, these techniques typically also suffer from various deficiencies such as slow convergence, computational complexity, or so forth when applied to log-likelihood maximization.