Clustering is a fundamental unsupervised learning methodology for data mining and machine learning. Additionally, clustering can be used in supervised learning for building non-parametric models (Beecks et al., 2011). The K-means algorithm (MacQueen, 1967) is one of the most widely used clustering algorithms because of its simple yet generally accepted objective function for optimization. Though several variants of K-means have evolved recently, a fundamental limitation of K-means and those variants is that they only apply to the Euclidean space.
In many emerging applications, however, the data to be processed are not vectors. In particular, bags of weighted vectors (e.g., bags of words, bags of visual words, sparsely represented histograms), which can be formulated as discrete distributions with finite but arbitrary supports, are often used as object descriptors (Chen and Wang, 2004; Pond et al., 2010). The discrete distribution clustering algorithm, also known as the D2-clustering (Li and Wang, 2008), tackles such type of data by optimizing an objective function defined in the same spirit as that used by K-means. Specifically, it minimizes the sum of squared distances between each object and its nearest centroid. D2-clustering is implemented and applied in a real-time image annotation system named ALIPR (Li and Wang, 2008).
There are several ways to measure the closeness between two discrete distributions, such as Kullback-Leibler divergence (Cover and Thomas, 2012) and L1 distance (Batu et al., 2013). Most metrics simply treat discrete distributions as regular histograms (after quantization if necessary) and measure their differences only by comparing the aligned bins among histograms. When the supports of discrete distributions are at arbitrary locations and of variable sizes, these metrics cannot effectively measure the distance because the supports of two distributions may not be aligned. In such situations, which are prevalent in applications using the bag-of-word representation, the transportation metric is useful because it measures the similarity by solving an optimal matching problem known as the transportation problem. The problem was initially formulated by French mathematician Monge (1781) and later by Russian mathematician Kantorovich (1942). The metric is known as the Kantorovich-Wasserstein metric. It also has several variants, as well as names, in part due to its wide application in different fields. Readers can refer to the book by Rachev and Ruschendorf (1998) for some of these applications. For example, the metric is better known as the earth mover's distance (EMD) (Rubner et al., 1998) in computer vision and the Mallows distance in statistics literature (Mallows, 1972). To simplify the discussion, hereafter we refer to the Kantorovich-Wasserstein metric as the Wasserstein distance.
The Wasserstein distance is robust to quantization noise and efficient with sparse representations. It is proved to be an effective metric for image retrieval and annotation. The D2-clustering algorithm is constructed based upon this distance and has been applied successfully to image annotation. In addition, this distance is also adopted in multiple data mining and machine learning applications such as video classification and retrieval (Xu and Chang, 2008), sequence categorization (Pond et al., 2010), and document retrieval (Wan, 2007), where objects are represented by discrete distributions. As a fundamental machine learning tool to group discrete distributions with Wasserstein distance, D2-clustering can be highly valuable in these areas.
A major concern in applying D2-clustering is its scalability. Because linear programming is needed in D2-clustering and the number of unknown variables grows with the number of objects in a cluster, the computational cost would normally increase polynomially with the sample size (assuming the number of clusters is roughly fixed). As a result, on a typical 2 GHz single-core CPU, ALIPR takes several minutes to learn each semantic category by performing D2-clustering on 80 images in it, but demands more than a day to complete the modeling of a semantic category containing thousands of training images. There are some alternative ways (Julien and Saitta, 2008; Yavlinsky et al., 2005) to perform the clustering faster, but they generally perform some indirect or sub-optimal clustering and hence lack the optimization properties of the D2-clustering.
The clustering problems researchers tackle today are often of very large scale. For instance, we are seeking to perform learning and retrieval on images from the Web containing millions of pictures annotated by one common word. This leads to demanding clustering tasks even if we perform learning only on a tiny fraction of the Web. Adopting bags of weighted vectors as descriptors, we can encounter the same issue in other applications such as video, biological sequence, and text clustering, where the amount of data can be enormous. In such cases, D2-clustering is computationally impractical.
Given the application potential of D2-clustering and its theoretical optimization advantages, it is of great interest to develop a variant of the algorithm with a substantially lower computational complexity, especially under a parallel computing environment. With the increasing popularity of large-scale cloud computing in real-world information systems, we are well motivated to exploit parallel computing.
K-Means Clustering
K-means is one of the most widely applied clustering algorithms on vectors. Given a set of N vectors {right arrow over (v)}1, {right arrow over (v)}2, . . . , {right arrow over (v)}N, K-means groups them into k clusters with centroids {right arrow over (Z)}1, {right arrow over (Z)}2, . . . , {right arrow over (Z)}k. Each vector v is assigned with a clustering label C(i)=j, j∈{1, 2, . . . , k} if vi belongs to the j-th cluster. The objective of K-means clustering is to minimize the total within-cluster dispersionε=Σj=1kΣi:C(i)=j∥{right arrow over (v)}i−{right arrow over (z)}j∥2.  (1)
Both the labeling C(⋅) and the centroids {right arrow over (z)}j's are involved in the optimization. Because they interlace with each other in the optimization, they are iteratively updated in two stages. Within each iteration, C(i) is updated by assigning each {right arrow over (v)}i with the label of its closest cluster centroid and then the centroid {right arrow over (z)}j for each cluster is updated by averaging the vectors attached to the cluster. For each of the two stages the updating decreases the objective in Eq. (1) monotonically. The iteration terminates when the objective stabilizes below a certain value, as defined in the stopping criteria.
Discrete Distributions and Wasserstein Distance
A discrete distribution gives the probability masses of a random variable at values within a finite set of range. It can be formulated by a bag of value-probability tuples at all possible values of the random variable. For instance, if a random variable v takes value at v(1), v(2), . . . , v(t) with probability pv(1), pv(2) . . . pv(t) (Σα=1tpv(α)=1), the discrete distribution is written asv={(v(1),pv(1)), . . . ,(v(t),pv(t))}.
Given another discrete distribution u={(u(1),pu(1)), (u(1),pu(1)), . . . , (u(t),pu(t))}, the Wasserstein distance D(v,u) between v and u is solved by the following linear program:
                                          D            ⁡                          (                              u                ,                v                            )                                =                                    (                                                min                                      w                                          α                      ,                      β                                                                      ⁢                                                      Σ                                          α                      =                      1                                        t                                    ⁢                                      Σ                                          β                      =                      1                                                              t                      ′                                                        ⁢                                      ω                                          α                      ,                      β                                                        ⁢                                                                          ⁢                                                            d                      2                                        ⁡                                          (                                                                        v                                                      (                            α                            )                                                                          ,                                                  u                                                      (                            β                            )                                                                                              )                                                                                  )                                      1              2                                      ,                            (        2        )            subject to:Σα=1tωα,β=pu(β),∀β=1, . . . ,t′, Σβ=1t′ωα,β=pv(α),∀α=1, . . . ,t, ωα,β≥0,∀α=1, . . . ,t,∀β=1, . . . ,t′. 
In Eq. (2), ωα,β is the matching weights between the α-th element in v and β-th element in u. We denote d(v(α),u(β)) as the distance between v(α) and u(β) in the sample space of the random variable. If the sample space is a Euclidean space, L2 norm is normally adopted as the measure, i.e., d(v(α), u(β))=∥v(α)−u(β)∥. Hereafter we assume that the random variable of a discrete distribution is sampled from an Euclidean space unless otherwise noted. In fact, the metric used by the sample space is not essential for the Wasserstein distance because what make the computation elaborate are the matching weights solved in the linear programming.
D2-Clustering
D2-clustering is a clustering algorithm for any collection of discrete distributions within the same sample space. Working in the same spirit as K-means, it also aims at minimizing the total within-cluster dispersion. Because the distances between discrete distributions are calculated using Wasserstein distance, the objective function to be minimized is the total squared Wasserstein distance between the data and the centroids.
Suppose there are N discrete distributions and k centroids, we denote the discrete distributions as the objective is then formulated asvi={(vi(1),pvi(1)), . . . ,(vi(1),pvi(ti))},i=1,2, . . . ,N, and centroids aszj={(zj(1),pzj(1)), . . . ,(zj(sj),pzj(sj))},j=1, . . . ,k, the objective is then formulated asε=Σj=1kΣi:C(i)=jD2(zj,vi)
Although this optimization criterion is used by K-means, the computation in D2-clustering is much more sophisticated due to the nature of the Wasserstein distance, which is chosen because of its advantageous properties. The D2-clustering algorithm is analogous to K-means in that it iterates the update of the partition and the update of the centroids. There are two stages within each iteration. First, the cluster label of each sample is assigned to be the label of its closest centroid; and second, centroids are updated by minimizing the within-cluster sum of squared distances. To update the partition in D2-clustering, we need to compute the Wasserstein distance between a sample and each centroid. To update the centroids, an embedded iterative procedure is needed which involves a large-scale linear programming problem in every iteration. The number of unknowns in the optimization grows with the number of samples in the cluster.
To update a certain centroid zj for a cluster of discrete distributions v1, v2 . . . , vnj attached to it, the following optimization is pursued.minzjΣinjD2(zj,vi)  (3)wherezj={(zj(1),pzj(1)), . . . ,(zj(sj),pzj(sj))},vi={(vi(1),pvi(1)), . . . ,(vi(ti),pvi(ti))},D2(zj,vi)=min ωα,β(i)Σα=1sjΣβ=1ti∥zj(α)−vi(β)∥2 subject to:pz(α)≥0,α=1, . . . ,sj,Σα=1sjpzj(α)=1,Σβ=1tiωα,β(i)=pzj(α),i=1, . . . ,nj,α=1, . . . ,sj,Σr=1sjωα,β(i)=pvi(β),i=1, . . . ,nj,β=1, . . . ,ti,α=1, . . . ,sj,ωα,β(i)>0,i=1, . . . ,nj,α=1, . . . ,sj,β=1, . . . ,ti.
Because zj is not fixed in Eq. (3), we cannot separately optimize the Wasserstein distances (zj,vi), i=1, . . . , nj, because zj is involved and the variables ωα,β(i) for different i's affect each other through zj. To solve the complex optimization problem, the D2-clustering algorithm adopts an iterative process as shown in Algorithm 1.
ALGORITRM 1: Centroid Update Process of D2-ClusteringInput: A collection of discrete distributions v1, v2, . . . ,vnj thatbelong to cluster j.Output: The centroid zj of the input discrete distributions in terms ofMallows distance.begin|while objective εj (defined in (3)) does not reach convergence do||Fix zj(α)'s update pzj(α)'s and wα,β(i)'s in (3). In this case the ||optimization reduces to a linear programming.||Fix pzj(α)'s and update zj(α), α = 1, . . . , sj. In this case the ||solution of zj(α)'s to optimize (3) are simply weighted averages:| | | | | | || | | | | | |            z      j              (        α        )              =                            ∑                      i            =            1                                n            j                          ⁢                                  ⁢                              ∑                          β              =              1                                      t              i                                ⁢                                          ⁢                                    w                              α                ,                β                                            (                i                )                                      ⁢                          v              i                              (                β                )                                                                          ∑                      i            =            1                                n            j                          ⁢                                  ⁢                              ∑                          β              =              1                                      t              i                                ⁢                      w                          α              ,              β                                      (              i              )                                            ,      α    =    1    ,  …  ⁢          ,            s      j        .  ||Calculate the objective function εj in (3) using the updated||zj(α)'s, pzj(α)'s, and wα,β(i)'s.|endend
Inside the “while” loop of Algorithm 1, the linear programming in line 1 is the primary cost in the centroid update process. It solves pzj(α), ωα,β(i), α=1, . . . , sj, i=1, . . . , nj, β=1, . . . , ti, a total of sj+sjΣi=1n_jti≃njsj2 parameters when s; is set to be the average of the ti's. The number of constraints is 1+njsj+Σ1njti≃2njsj. Hence, the number of parameters in the linear program grows linearly with the size of the cluster. Because it costs polynomial time to solve the linear programming problem (Spielman and Teng, 2004), D2-clustering has a much higher computational complexity than K-means algorithm.
The analysis of the time complexity of K-means remains an unsolved problem because the number of iterations for convergence is difficult to determine. Theoretically, the worst case time complexity for K-means is Ω(2√{square root over (n)}) (Arthur and Vassilvitskii, 2006). Arthur et al. (2011) show recently that K-means has polynomial smoothed complexity, which reflects the fact that K-means converges in a relatively short time in real cases (although no tight upper or lower bound of time complexity has been proved). In practice, K-means usually converges fast, and seldom costs an exponential number of iterations. In many cases K-means converges in linear or even sub-linear time (Duda et al., 2012). Kumar et al. (2010) present a random sampling and estimation approach to guarantee K-means to compete in linear time. Although there is still a gap between the theoretical explanation and practice, we can consider K-means an algorithm with at most polynomial time.
Although the D2-clustering algorithm interlaces the update of clustering labels and centroids in the same manner as K-means, the update of centroids is much more complicated. As described above, to update each centroid involves an iterative procedure where a large-scale linear programming problem detailed in (3) has to be solved in every round. This makes D2-clustering polynomially more complex than K-means in general.
The discrete distribution, or discrete probability measure, is a well-adopted way to summarize a mass of data. It often serves as a descriptor for complex instances encountered in machine learning, e.g., images, sequences, and documents, where each instance itself is converted to a data collection instead of a vector. In a variety of research areas including multimedia retrieval and document analysis, the celebrated bag of “words” data model is intrinsically a discrete distribution. The widely used normalized histogram is a special case of discrete distributions with a fixed set of support points across the instances. In many applications involving moderate or high dimensions, the number of bins in the histograms is enormous because of the diversity across instances, while the histogram for any individual instance is highly sparse. This is frequently observed for collections of images and text documents.
The discrete distribution can function as a sparse representation for the histogram, where support points and their probabilities are specified in pairs, eliminating the restriction of a fixed quantization codebook across instances.
The Wasserstein distance is a true metric for measures and can be traced back to the mass transport problem proposed by Monge in 1780s and the relaxed formulation by Kantorovich in the 1940s. Mallows used this distance to prove some theoretical results in statistics, and thus the name Mallows distance has been used by statisticians. In computer science, it is better known as the Earth Mover's Distance. In signal processing, it is closely related to the Optimal Sub-Pattern Assignment (OSPA) distance used recently for multi-target tracking. The Wasserstein distance is computationally costly because the calculation of the distance has no closed form solution and its worst time complexity is at least O(n3 log n) subject to the number of support points in the distribution.
Clustering Distributions:
Distribution clustering can be done subject to different affinity definitions. For example, Bregman clustering pursues the minimal distortion between the cluster prototype—called the Bregman representative—and cluster members according to a certain Bregman divergence (Banerjee et al. 2005). In comparison, D2-clustering is an extension of K-means to discrete distributions under the Wasserstein distance (Li and Wang 2008), and the cluster prototype is an approximate Wasserstein barycenter with sparse support. In the D2-clustering framework, solving the cluster prototype or the centroid for discrete distributions under the Wasserstein distance is computationally challenging (Cuturi and Doucet 2014; Ye and Li 2014; Zhang, Wang, and Li 2015). In order to scale up the computation of D2-clustering, a divide-and-conquer approach has been proposed (Zhang, Wang, and Li 2015), but the method is ad-hoc from optimization perspective. A standard ADMM approach has also been explored (Ye and Li 2014), but its efficiency is still inadequate for large datasets. Although fast computation of Wasserstein distances has been much explored (Pele and Werman 2009; Cuturi 2013; H. Wang and Banerjee 2014), how to perform top-down clustering efficiently based on the distance has not. We present below the related work and discuss their relationships with our current work.
The True Discrete Wasserstein Barycenter and Two Approximation Strategies:
The centroid of a collection of distributions minimizing the average pth-order power of the Lp Wasserstein distance is called Wasserstein barycenter (Agueh and Carlier 2011). In the D2-clustering algorithm (Li and Wang 2008), the 2nd order Wasserstein barycenter is simply referred to as a prototype or centroid; and is solved for the case of an unknown support with a pre-given cardinality. The existence, uniqueness, regularity and other properties of the 2nd order Wasserstein barycenter have been established mathematically for continuous measures in the Euclidean space (Agueh and Carlier 2011). But the situation is more intricate for discrete distributions, as will be explained later.
Given N arbitrary discrete distributions each with m support points, their true Wasserstein barycenter in theory can be solved via linear programming (Agueh and Carlier 2011; Anderes, Borgwardt, and Miller 2015). This is because the support points of the Wasserstein barycenter can only locate at a finite (yet huge) number of possible positions. However, solving the true discrete barycenter quickly becomes intractable even for a rather small number of distributions containing only 10 support points each. An important theoretical progress has been made by Anderes et al. (Anderes, Borgwardt, and Miller 2015) who proved that the actual support of a true barycenter of N such distributions is extremely sparse, with cardinality m no greater than mN. However, the complexity of the problem is not reduced practically because so far there is no theoretically ensured way to sift out the optimal sparse locations. On the other hand, the new theory seems to backup the practice of assuming a pre-selected number of support points in a barycenter as an approximation to the true solution.
To achieve good approximation, there are two computational strategies one can adopt in an optimization framework:                1. Carefully select beforehand a large and representative set of support points as an approximation to the support of the true barycenter (e.g. by K-means).        2. Allow the support points in a barycenter to adjust positions at every τ iterations.        
The first strategy of fixing the support of a barycenter can yield adequate approximation quality in low dimensions (e.g. 1D/2D histogram data) (Cuturi and Doucet 2014; Benamou et al. 2015), but can face the challenge of exponentially growing support size when the dimension increases. The second strategy allows one to use a possibly much smaller number of support points in a barycenter to achieve the same level of accuracy (Li and Wang 2008; Zhang, Wang, and Li 2015; Ye and Li 2014; Cuturi and Doucet 2014). Because the time complexity per iteration of existing iterative methods is O(mmN), a smaller m can also save substantial computational time so as to trade in with the extra amount of time O(mmdN/τ) to recalculate the distance matrices. In the extreme case when the barycenter support size is set to one (m=1), D2-clustering reduces to K-means on the distribution means, which is a meaningful way of data reduction in its own right. Our experiments indicate that in practice a large m in D2-clustering is usually unnecessary.
In applications on high-dimensional data, it is desirable to optimize the support points rather than fix them from the beginning. This however leads to a non-convex optimization problem. Our work aims at developing practical numerical methods. In particular, the method optimizes jointly the locations and weights of the support points in a single loop without resorting to a bi-level optimization reformulation, as was done in earlier work (Li and Wang 2008; Cuturi and Doucet 2014).
Solving Discrete Wasserstein Barycenter in Different Data Settings:
Recently, a series of works have been devoted to solving the Wasserstein barycenter given a set of distributions (e.g. (Carlier, Obennan, and Oudet 2014; Cuturi and Doucet 2014; Benamou et al. 2015; Ye and Li 2014; Cuturi, Peyre, and Rolet 2015)). How our method compares with the existing ones depends strongly on the specific data setting. We discuss the comparisons in details below and motivate the use of our new method in D2-clustering.
In (Benamou et al. 2015; Cuturi and Doucet 2014; Cuturi 2013), novel algorithms have been developed for solving the Wasserstein barycenter by adding an entropy regularization term on the optimal transport matching weights. The regularization is imposed on the transport weights, but not the barycenter distribution. In particular, iterative Bregman projection (IBP) (Benamou et al. 2015) can solve an approximation to the Wasserstein barycenter. IBP is highly memory efficient for distributions with a shared support set (e.g. histogram data), with a memory complexity O((m+m)N). In comparison, our modified B-ADMM approach is of the same time complexity, but requires 0 (mmN) memory. If N is large, memory constraints can limit our approach to problems with relatively small m or m. Consequently, the second approximation strategy is crucial for reaching high accuracy by our approach, while the first strategy may not meet the memory constraint. In the conventional OT literature, the emphasis is on computing the Wasserstein barycenter for a small number of instances with dense representations (e.g. Solomon et al. 2015; Rabin et al. 2011); and IBP is more suitable. But in many machine learning and signal processing applications, each instance is represented by a discrete distribution with a sparse support set (aka, m is small). The memory limitation of B-ADMM can be avoided via parallelization until the time budget is spent. Our focus is thus to achieve scalability in N. B-ADMM has important advantages over IBP that motivate the usage of the former in practice.
First of all, it is interesting to note that if the distributions do not share the support set, IBP (Benamou et al. 2015) has the same memory complexity O(mmN) (for caching the distance matrix per instance) as our approach. In addition, B-ADMM (H. Wang and Banerjee 2014), based on which our approach is developed, has the following advantages: (1) It yields the exact OT and distance in the limit of iterations. Note that the ADMM parameter does not trade off the convergence rate. (2) It exhibits different convergence behaviors, accommodating warm starts and early stops (to be illustrated later), valuable traits for the task of D2-clustering. (3) It works well with single-precision floats, thus not pestered by the machine precision constraint. In contrast, this issue is serious for IBP, preventing it from robustly outputting adequately accurate discrete Wasserstein barycenters with sparse support (See Benamou et al. 2015) and our experiments). Here the “adequately accurate” means close to a local minimizer of sum of the (squared) Wasserstein distances.
Optimization Methods Revisited:
Our main algorithm disclosed herein is inspired by the B-ADMM algorithm of Wang and Banerjee (H. Wang and Banerjee 2014) for solving OT fast. They developed the two-block version of ADMM (Boyd et al. 2011) along with Bregman divergence to solve the OT problem when the number of support points is extremely large. Its algorithmic relation to IBP (Benamou et al. 2015) is discussed herein. The OT problem at a moderate scale can in fact be efficiently handled by state-of-the-art LP solvers (Tang et al. 2013). As demonstrated by the line of work on solving the barycenter, optimizing the Wasserstein barycenter is rather different from computing the distance. Naively adapting the B-ADMM to Wasserstein barycenter does not result in a proper algorithm.