Clustering is a fundamental unsupervised learning methodology for data mining and machine learning. Additionally, clustering can be used in supervised learning for building non-parametric models (Beecks et al., 2011). The K-means algorithm (MacQueen, 1967) is one of the most widely used clustering algorithms because of its simple yet generally accepted objective function for optimization. Though several variants of K-means have evolved recently, a fundamental limitation of K-means and those variants is that they only apply to the Euclidean space.
In many emerging applications, however, the data to be processed are not vectors. In particular, bags of weighted vectors (e.g., bags of words, bags of visual words, sparsely represented histograms), which can be formulated as discrete distributions with finite but arbitrary supports, are often used as object descriptors (Chen and Wang, 2004; Pond et al., 2010). The discrete distribution clustering algorithm, also known as the D2-clustering (Li and Wang, 2008), tackles such type of data by optimizing an objective function defined in the same spirit as that used by K-means. Specifically, it minimizes the sum of squared distances between each object and its nearest centroid. D2-clustering is implemented and applied in a real-time image annotation system named ALIPR (Li and Wang, 2008).
There are several ways to measure the closeness between two discrete distributions, such as Kullback-Leibler divergence (Cover and Thomas, 2012) and L1 distance (Batu et al., 2013). Most metrics simply treat discrete distributions as regular histograms (after quantization if necessary) and measure their differences only by comparing the aligned bins among histograms. When the supports of discrete distributions are at arbitrary locations and of variable sizes, these metrics cannot effectively measure the distance because the supports of two distributions may not be aligned. In such situations, which are prevalent in applications using the bag-of-word representation, the transportation metric is useful because it measures the similarity by solving an optimal matching problem known as the transportation problem. The problem was initially formulated by French mathematician Monge (1781) and later by Russian mathematician Kantorovich (1942). The metric is known as the Kantorovich-Wasserstein metric. It also has several variants, as well as names, in part due to its wide application in different fields. Readers can refer to the book by Rachev and Ruschendorf (1998) for some of these applications. For example, the metric is better known as the earth mover's distance (EMD) (Rubner et al., 1998) in computer vision and the Mallows distance in statistics literature (Mallows, 1972). To simplify the discussion, hereafter we refer to the Kantorovich-Wasserstein metric as the Mallows distance.
The Mallows distance is robust to quantization noise and efficient with sparse representations. It is proved to be an effective metric for image retrieval and annotation. The D2-clustering algorithm is constructed based upon this distance and has been applied successfully to image annotation. In addition, this distance is also adopted in multiple data mining and machine learning applications such as video classification and retrieval (Xu and Chang, 2008), sequence categorization (Pond et al., 2010), and document retrieval (Wan, 2007), where objects are represented by discrete distributions. As a fundamental machine learning tool to group discrete distributions with Mallows distance, D2-clustering can be highly valuable in these areas.
A major concern in applying D2-clustering is its scalability. Because linear programming is needed in D2-clustering and the number of unknown variables grows with the number of objects in a cluster, the computational cost would normally increase polynomially with the sample size (assuming the number of clusters is roughly fixed). As a result, on a typical 2 GHz single-core CPU, ALIPR takes several minutes to learn each semantic category by performing D2-clustering on 80 images in it, but demands more than a day to complete the modeling of a semantic category containing thousands of training images. There are some alternative ways (Julien and Saitta, 2008; Yavlinsky et al., 2005) to perform the clustering faster, but they generally perform some indirect or sub-optimal clustering and hence lack the optimization properties of the D2-clustering.
The clustering problems researchers tackle today are often of very large scale. For instance, we are seeking to perform learning and retrieval on images from the Web containing millions of pictures annotated by one common word. This leads to demanding clustering tasks even if we perform learning only on a tiny fraction of the Web. Adopting bags of weighted vectors as descriptors, we can encounter the same issue in other applications such as video, biological sequence, and text clustering, where the amount of data can be enormous. In such cases, D2-clustering is computationally impractical.
Given the application potential of D2-clustering and its theoretical optimization advantages, it is of great interest to develop a variant of the algorithm with a substantially lower computational complexity, especially under a parallel computing environment. With the increasing popularity of large-scale cloud computing in real-world information systems, we are well motivated to exploit parallel computing.
K-Means Clustering
K-means is one of the most widely applied clustering algorithms on vectors. Given a set of N vectors {right arrow over (v)}1, {right arrow over (v)}2, . . . , {right arrow over (v)}N, K-means groups them into k clusters with centroids {right arrow over (z)}1, {right arrow over (z)}2, . . . , {right arrow over (z)}k, Each vector {right arrow over (v)}i is assigned with a clustering label C(i)=j, jε{1, 2, . . . , k} if vi belongs to the j-th cluster. The objective of K-means clustering is to minimize the total within-cluster dispersionε=Σj=1kΣi:C(i)=j∥{right arrow over (v)}i−{right arrow over (z)}j∥2.  (1)
Both the labeling C(•) and the centroids {right arrow over (z)}j's are involved in the optimization. Because they interlace with each other in the optimization, they are iteratively updated in two stages. Within each iteration, C(i) is updated by assigning each {right arrow over (v)}i with the label of its closest cluster centroid and then the centroid {right arrow over (z)}j for each cluster is updated by averaging the vectors attached to the cluster. For each of the two stages the updating decreases the objective in Eq. (1) monotonically. The iteration terminates when the objective stabilizes below a certain value, as defined in the stopping criteria.
Discrete Distributions and Mallows Distance
A discrete distribution gives the probability masses of a random variable at values within a finite set of range. It can be formulated by a bag of value-probability tuples at all possible values of the random variable. For instance, if a random variable v takes value at v(1), v(2), . . . , v(t) with probability pv(1)pv(2) . . . pv(t) (Σα=1tpv(α)=1), the discrete distribution is written asv={(v(1),pv(1)), . . . ,(v(t),pv(t))}.
Given another discrete distribution u={(u(1), pu(1)),(u(1),pu(1)}, . . . ,(u(t),pu(t)), the Mallows distance D(v,u) between v and u is solved by the following linear program:
                                          D            ⁡                          (                              u                ,                v                            )                                =                                    (                                                min                                      w                                          α                      ,                      β                                                                      ⁢                                                      ∑                                          α                      =                      1                                        t                                    ⁢                                                                          ⁢                                                            ∑                                              β                        =                        1                                                                    τ                        ′                                                              ⁢                                                                                  ⁢                                                                  ω                                                  α                          ,                          β                                                                    ⁢                                                                        d                          2                                                ⁡                                                  (                                                                                    V                                                              (                                α                                )                                                                                      ,                                                          u                                                              (                                β                                )                                                                                                              )                                                                                                                                )                                      1              2                                      ,                            (        2        )            subject to:Σα=1tωα,β=pu(β),∇β=1, . . . ,t′, Σβ=1t′ωα,β=pv(α),∇α=1, . . . ,t, wα,β≧0,∇α=1, . . . ,t,∇β=1, . . . ,t′. 
In Eq. (2), ωα,β is the matching weights between the α-th element in v and β-th element in u. We denote d(v(α),u(β)) as the distance between v(α) and u(β) in the sample space of the random variable. If the sample space is a Euclidean space, L2 norm is normally adopted as the measure, i.e., d(v(α),u(β))=∥v(α)−u(β)∥. Hereafter we assume that the random variable of a discrete distribution is sampled from an Euclidean space unless otherwise noted. In fact, the metric used by the sample space is not essential for the Mallows distance because what make the computation elaborate are the matching weights solved in the linear programming.
D2-Clustering
D2-clustering is a clustering algorithm for any collection of discrete distributions within the same sample space. Working in the same spirit as K-means, it also aims at minimizing the total within-cluster dispersion. Because the distances between discrete distributions are calculated using Mallows distance, the objective function to be minimized is the total squared Mallows distance between the data and the centroids.
Suppose there are N discrete distributions and k centroids, we denote the discrete distributions as the objective is then formulated asv1={(vi(1),pvi(1)), . . . ,(vi(ti),pvi(ti))}, i=1,2, . . . ,N, and centroids aszj={(zj(1),pzi(1)), . . . ,(zj(sj),pzj(sj))}, j=1, . . . ,k, the objective is then formulated asε=Σj=1kΣi:C(i)=jD2(zj,vi).
Although this optimization criterion is used by K-means, the computation in D2-clustering is much more sophisticated due to the nature of the Mallows distance, which is chosen because of its advantageous properties. The D2-clustering algorithm is analogous to K-means in that it iterates the update of the partition and the update of the centroids. There are two stages within each iteration. First, the cluster label of each sample is assigned to be the label of its closest centroid; and second, centroids are updated by minimizing the within-cluster sum of squared distances. To update the partition in D2-clustering, we need to compute the Mallow's distance between a sample and each centroid. To update the centroids, an embedded iterative procedure is needed which involves a large-scale linear programming problem in every iteration. The number of unknowns in the optimization grows with the number of samples in the cluster.
To update a certain centroid zj for a cluster of discrete distributions v1, v2, . . . , vnj attached to it, the following optimization is pursued.
                                          min                          z              j                                ⁢                                    ∑              i                              n                j                                      ⁢                                                  ⁢                                          D                2                            ⁡                              (                                                      z                    j                                    ,                                      v                    i                                                  )                                                    ⁢                                  ⁢        where        ⁢                                  ⁢                                            z              j                        =                          {                                                (                                                            z                      j                                              (                        1                        )                                                              ,                                          p                                              z                        j                                                                    (                        1                        )                                                                              )                                ,                …                ⁢                                                                  ,                                  (                                                            z                      j                                              (                                                  s                          j                                                )                                                              ,                                          p                                              z                        j                                                                    (                                                  s                          j                                                )                                                                              )                                            }                                ,                                          ⁢                                    v              i                        =                          {                                                (                                                            v                      i                                              (                        1                        )                                                              ,                                          p                                              v                        i                                                                    (                        1                        )                                                                              )                                ,                …                ⁢                                                                  ,                                  (                                      v                    i                                          (                                              t                        i                                            )                                                        )                                ,                                  p                                      v                    i                                                        (                                          t                      i                                        )                                                              }                                ,                                          ⁢                                                    D                2                            ⁡                              (                                                      z                    j                                    ,                                      v                    i                                                  )                                      =                                                            min                                      ω                                          α                      ,                      β                                                              (                      i                      )                                                                      ⁢                                                      ∑                                          α                      =                      1                                                              s                      j                                                        ⁢                                                                          ⁢                                                            ∑                                              β                        =                        1                                                                    t                        i                                                              ⁢                                                                                  ⁢                                                                                                                                                z                            j                                                          (                              α                              )                                                                                -                                                      v                            i                                                          (                              β                              )                                                                                                                                                  2                                                                                  ⁢                                                          ⁢                              subject                ⁢                                                                  ⁢                to                ⁢                                  :                                            ⁢                                                          ⁢                                                                    p                    z                                          (                      α                      )                                                        ≥                  0                                ,                                  α                  =                  1                                ,                …                ⁢                                                                  ,                                  s                  j                                ,                                                                            ∑                                              α                        =                        1                                                                    s                        j                                                              ⁢                                                                                  ⁢                                          p                                              z                        i                                                                    (                        α                        )                                                                              =                  1                                ,                                                                  ⁢                                                                            ∑                                              β                        =                        1                                                                    t                        i                                                              ⁢                                                                                  ⁢                                          ω                                              α                        ,                        β                                                                    (                        i                        )                                                                              =                                      p                                          z                      j                                                              (                      α                      )                                                                      ,                                  i                  =                  1                                ,                …                ⁢                                                                  ,                                  n                  j                                ,                                  α                  =                  1                                ,                …                ⁢                                                                  ,                                  s                  j                                ,                                                                  ⁢                                                                            ∑                                              r                        =                        1                                                                    s                        j                                                              ⁢                                                                                  ⁢                                          ω                                              α                        ,                        β                                                                    (                        i                        )                                                                              =                                      p                                          v                      i                                                              (                      β                      )                                                                      ,                                  i                  =                  1                                ,                …                ⁢                                                                  ,                                  n                  j                                ,                                                                  ⁢                                  β                  =                  1                                ,                …                ⁢                                                                  ,                                  t                  i                                ,                                  α                  =                  1                                ,                …                ⁢                                                                  ,                                                                            s                      j                                        .                                                                                  ⁢                                          ω                                              α                        ,                        β                                                                    (                        i                        )                                                                              >                  0                                ,                                  i                  =                  1                                ,                …                ⁢                                                                  ,                                  n                  j                                ,                                  α                  =                  1                                ,                …                ⁢                                                                  ,                                  s                  j                                ,                                  β                  =                  1                                ,                …                ⁢                                                                  ,                                                      t                    i                                    .                                                                                        (        3        )            
Because zj is not fixed in Eq. (3), we cannot separately optimize the Mallows distances D(zi,vi), i=1, . . . nj, because zj is involved and the variables ωα,β(i) for different i's affect each other through zj. To solve the complex optimization problem, the D2-clustering algorithm adopts an iterative process as shown in Algorithm 1.
ALGORITHM 1: Centroid Update Process of D2-ClusteringInput: A collection of discrete distributions v1, v2, . . . ,vnj that belong tocluster j.Output: The centroid zj of the input discrete distributions in terms ofMallows distance.begin| while objective εj (defined in (3)) does not reach convergence do| | Fix zj(α)'s and update pzj(α)'s and wα,β(i)'s in (3). In this case the | | optimization reduces to a linear programming.| | Fix pzj(α)'s and wα,β(i)'s and update zj(α), α = 1, . . . , sj. In this| | case the solution of zj(α)'s to optimize (3) are simply weighted| | averages:| |  | |  | |  | |  | |  | |  | |             z      j              (        α        )              =                            ∑                      i            =            1                                n            j                          ⁢                              ∑                          β              =              1                                      l              i                                ⁢                                    w                              α                ,                β                                            (                i                )                                      ⁢                          v              i                              (                β                )                                                                          ∑                      i            =            1                                n            j                          ⁢                              ∑                          β              =              1                                      l              i                                ⁢                      w                          α              ,              β                                      (              i              )                                            ,      α    =    1    ,      …    ⁢                  ⁢                  s        j            .      | | Calculate the objective function εj in (3) using the updated zj(α)'s,| | pzj(α)'s and wα,β(i)'s.| endend
Inside the “while” loop of Algorithm 1, the linear programming in line 1 is the primary cost in the centroid update process. It solves pzj(α),ω2,β(i), α=1, . . . ,sj, i=1, . . . , nj, β=1, . . . , ti, a total of sj+sjΣt=1n_jti≅njsj2 parameters when sj is set to be the average of the ti's. The number of constraints is 1+njsj+Σ1njti≅2njsj. Hence, the number of parameters in the linear program grows linearly with the size of the cluster. Because it costs polynomial time to solve the linear programming problem (Spielman and Teng, 2004), D2-clustering has a much higher computational complexity than K-means algorithm.
The analysis of the time complexity of K-means remains an unsolved problem because the number of iterations for convergence is difficult to determine. Theoretically, the worst case time complexity for K-means is Ω(2wn)(Arthur and Vassilvitskii, 2006). Arthur et al. (2011) show recently that K-means has polynomial smoothed complexity, which reflects the fact that K-means converges in a relatively short time in real cases (although no tight upper or lower bound of time complexity has been proved). In practice, K-means usually converges fast, and seldom costs an exponential number of iterations. In many cases K-means converges in linear or even sub-linear time (Duda et al., 2012). Kumar et al. (2010) present a random sampling and estimation approach to guarantee K-means to compete in linear time. Although there is still a gap between the theoretical explanation and practice, we can consider K-means an algorithm with at most polynomial time.
Although the D2-clustering algorithm interlaces the update of clustering labels and centroids in the same manner as K-means, the update of centroids is much more complicated. As described above, to update each centroid involves an iterative procedure where a large-scale linear programming problem detailed in (3) has to be solved in every round. This makes D2-clustering polynomially more complex than K-means in general.