As commonly understood, “data” refers to a collection of organized information, the result of experience, observation or experiment, to other information within a computer system, to a set of premises that may consist of numbers, characters or images, or to measurements of observations. Its use and properties are described in detail in U.S. patent application Ser. No. 12/388,371, “Classification and Recognition via Diffusion and Anomaly Processing” by Amir Averbuch, Ronald R. Coifman and Gil David, which is incorporated herein by reference in its entirety. Also described and defined therein are terms used in this invention such as “diffusion maps”, “affinity matrix” and “distance metric”.
In many cases, the data is high-dimensional (also called multi-dimensional), with a data dimension N>3. Multi-dimensional data is a collection of data points. A “data point” (also referred to herein as “sample”, “sampled data”, “point”, “vector of observations” and “vector of measurements”) is one unit of data of the original (“source” or “raw”) multi-dimensional data. A data point may be expressed by Boolean, numeric values and characters, or combinations thereof. If source data is described for example by 25 measured parameters (also referred to as “features”) which are sampled (recorded, measured) in a predetermined time interval (e.g. every minute), then the data is of dimension N=25. In this case, each data point is a vector of dimension 25.
In this description, the term “feature” refers to an individual measurable property of phenomena being observed. A feature is usually numeric, but may also be structural, for example a string. “Feature” is also normally used to denote a piece of information which is relevant for solving the computational task related to a certain application. “Feature” may also refer to a specific structure, ranging from a simple structure to a more complex structure such as an object. The “feature” concept is very general and the choice of features in a particular application may be highly dependent on the specific problem at hand. In the example above in which the data is of dimension N=25, each component in the vector of dimension 25 is a feature.
“Clustering”, as applied to data comprised of data points, refers to the process of finding in the data similar areas which identify common (similar) trends. These areas are called clusters. “Clustering” is also defined as the assignment of a set of observations into subsets (the clusters), such that observations in the same cluster are similar in some sense. Data clustering algorithms can be hierarchical. Hierarchical algorithms find successive clusters using previously established clusters. “Successive” refers to an operation which advances in time. Hierarchical algorithms can be either agglomerative (“bottom-up”) or divisive (“top-down”). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. Hierarchical clustering methods are described for example in S. C. Johnson, “Hierarchical Clustering Schemes”, Psychometrika, vol. 2, pages 241-254, 1967 and in U.S. Pat. Nos. 7,590,291 and 7,627,542, all of which are incorporated herein by reference in their entirety.
A distance measure is an important step in any clustering. The distance measure determines how the similarity of two data points is determined. This influences the shape of the clusters, as some data points may be close to one another according to one distance measure and far from one another according to another distance measure.
Diffusion maps were introduced in R. R. Coifman and S. Lafon, “Diffusion maps”, Applied and Computational Harmonic Analysis, vol. 21(1), pages 5-30, 2006 (referred to hereinafter as “DM”) and in US patent application 20060004753A1, both incorporated herein by reference in their entirety. A diffusion map constructs coordinates that parameterize the dataset, while a diffusion distance provides a local preserving metric for this data. Let Γ={x1, . . . xn} be a set of points in Rn. We construct the graph G (V, E),|V|=m,|E|<<m2, on Γ in order to find the intrinsic geometry of this set. A weight function W∈=w∈(xi, xj), which measures the pair-wise similarity between the points in a dataset, is introduced. For all xi, xj∈Γ, this weight function is symmetric, non-negative and positive semi-definite. A common choice for W∈ is
      w    ɛ    =      ⅇ          -                                                                                                              x                  i                                -                                  x                  j                                                                                        2                ɛ            and ∈ is a parameter determined as explained below. The non-negativity property of W∈ allows to normalize the assigned data into a Markov transition matrix P={p(xi, xj)},i,j=1, . . . , m, in which the states of the corresponding Markov process are the data points. This enables to analyze Γ as a random walk.
The construction of P is known as the normalized graph Laplacian, described in Spectral Graph Theory, Fan R. K. Chung, CBMS Regional Conference Series in Mathematics, No. 92, 1997. Formally, P={p(xi, xj)}i,j=1m is constructed as
            p      ⁡              (                              x            i                    ,                      x            j                          )              =                  w        ɛ                    d        ⁡                  (                      x            i                    )                      ,          ⁢            where      ⁢                          ⁢              d        ⁡                  (                      x            i                    )                      =                  ∫        Γ            ⁢                                    w            ɛ                    ⁡                      (                                          x                i                            ,                              x                j                                      )                          ⁢                                  ⁢                  ⅆ                      μ            ⁡                          (                              x                j                            )                                          is the degree of xi and μ is the distribution of the points on Γ. P is a Markov matrix, since the sum of each row in P is 1 and P(xi, xj)≧0. Thus, p(xi, xj) can be viewed as the probability to move from one point xi to another xj in one time-step. By raising this quantity to a power t (advance in time), this influence is propagated to nodes in the neighborhood of xi and xj and the result is the probability for this move in t time-steps. We denote this probability by pt(xi, xj). The probabilities between all the data points (for t=1) or set of folders (for t>1) measure the connectivity among the data points within the graph G(V, E),|V|=m,|E|<<m2. The parameter t controls the scale of the neighborhood in addition to the scale control provided by ∈.
            p      ⁡              (                              x            i                    ,                      x            j                          )              =                                        ⅆ                          (                              x                i                            )                                                            ⅆ                          (                              x                j                            )                                          ⁢              p        ⁡                  (                                    x              i                        ,                          x              j                                )                      ,which is a symmetric and positive definite kernel, leads to the following eigen-decomposition: p(xi, xj)=Σk>0mλkvk(xi)vk(xj). A similar eigen-decomposition is obtained from pt(xi, xj)=Σk≧0mλktvk(xi)vk(xj) after advancing t times on the graph. Here pt(xi, xj) is the probability of transition from xi to xj in t time-steps.
A fast decay of {λk} is achieved by a choice of ∈. Thus, only a few terms are required in the sum above to achieve a given relative cover δ>0. Assume η(δ) to be the number of retained terms. The diffusion maps introduced in DM include a family Φt(x)m∈ given by φt(x)=(λ0tv0(x), λ1tv1(x), . . . )T. The map Φm:Γ→lN embeds the dataset into a Euclidean space RN. The diffusion distance is defined as Dt2(xi, xj)=Σk≧0(pt(xi, xk)−pt(xk, xj))2. The diffusion distance can be expressed in terms of the right eigenvectors of P: Dt2(xi, xj)=Σk≧0λk2t(vk(xi)−vk(xj))2. It follows that in order to compute the diffusion distance, we can use the eigenvectors of {tilde over (P)}. Moreover, this facilities the embedding of the original points in a Euclidean space Rη(δ)−1 by Ξt:xi→(λ0tv0(xi),λ1tv1(xi),λ2tv2(xi), . . . , λη(δ)tvη(δ)(xi)). This also provides coordinates on the set Γ. Essentially, η(δ)<<m, due to the fast spectral decay of the spectrum of P.
P is the affinity matrix of the dataset and it is used to find the diffusion distances between data points. This distance metric can be used to cluster the data points according to the propagation of the diffusion distances that is controlled by t. In addition, it can be used to construct a bottom up hierarchical clustering of the data. For t=1, the affinity matrix reflects local and direct connections between adjacent data points. The resulting clusters preserve the local neighborhood of each point. By raising t, the affinity matrix is changed accordingly and it reflects indirect connections between data points in the graph. The diffusion distance between data points in the graph represents all possible paths between these points according to the step in time. The more we advance in time, the more we increase indirect and global connections. Therefore, by raising t we can construct the top levels of the clustering hierarchy. In each advance in time, it is possible to merge more and more bottom-level clusters since there are more and more new paths between them. The resulting clusters reflect the global neighborhood of each point, which is highly affected by the parameter t.
In known hierarchical clustering methods, the affinity matrix and the diffusion distances are global. With the advance in time, more global “rare connections” (sparse, loose data points) become part of the generated clusters. This translates into increased noise in the affinity matrix. The resulting clusters become sensitive to the parameters t and ∈ and to the geometry of the dataset. In other words, by increasing t in this global approach, the clustering noise in the affinity matrix is increased. This causes convergence of data points to only a few clusters which may be “wrong” clusters, leading to a decrease in the clustering accuracy.
Accordingly, there is a need for and it would be advantageous to have a hierarchical clustering method that uses a local instead of a global approach in order to increase the accuracy of the clustering.