1. Technical Field
The present invention relates to a data processing method and a computer-readable medium encoded with a computer program to execute thereof. More particularly, the present invention relates to a method for clustering data and a computer-readable medium encoded with a computer program to execute thereof.
2. Description of Related Art
In statistics, hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: (1) Agglomerative. This is a “bottom up” approach where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. (2) Divisive. This is a “top down” approach where all observations starts in one cluster, and splits are performed recursively as one moves down the hierarchy.
In order to decide which cluster should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and linkage criteria, which specifies the dissimilarity of sets as a function of pair-wise distances of observations in the sets. Some commonly used metrics for hierarchical clustering are: Euclidean distance, squared Euclidean distance, Manhattan distance, maximum distance, Mahalanobis distance and cosine similarity. The linkage criteria determine the distance between sets of observations as a function of the pair-wise distances between observations. Some commonly used linkage criteria between two sets of observations are Maximum or complete linkage clustering, Minimum or single-linkage clustering, and Mean or average linkage clustering.
Although an abundance of metrics and linkage criteria exist, none is designed for tracking error in the process of clustering—an exacting task essential for improving accuracy and for quantitatively assessing is data-originated uncertainty.