Exemplary embodiments of the present invention relate to data classification, and more particularly, to shape interpolation of clustered data.
Data mining involves sorting through large amounts of data and extracting relevant predictive information. Traditionally used by business intelligence organizations and financial analysts, data mining is increasingly being used in the sciences to extract information from the enormous datasets that are generated by modern experimental and observational methods. Data mining can be used to identify trends within data that go beyond simple analysis through the use of sophisticated algorithms.
Many data mining applications depend on the partitioning data elements into related subsets. Therefore, classification and clustering are important tasks in data mining. Clustering is the unsupervised categorization of objects into different groups, or more precisely, the organizing of a collection of patterns (usually represented as a vector of measurements, or a point in a multidimensional space) into clusters based on similarity. A cluster is a collection of objects that are “similar” between them and “dissimilar” to the objects belonging to other clusters. The goal of clustering is to determine an intrinsic grouping, or structure, in a set of unlabeled data. Clustering can be used to perform statistical data analysis in many fields, including machine learning, data mining, document retrieval, pattern recognition, medical imaging and other image analysis, and bioinformatics.
Classification is a statistical procedure in which individual items are placed into groups based on quantitative information on one or more traits inherent in the items and based on a training set of previously labeled (or pre-classified) patterns. As with clustering, a dataset is divided into groups based upon proximity such that the members of each group are as “close” as possible to one another, and different groups are as “far” as possible from one another, where distance is measured with respect to specific trait(s) that are being analyzed.
An important difference should be noted when comparing clustering and classification. In classification, a collection of labeled patterns is provided, and the problem is to label a newly encountered, yet unlabeled, pattern. Typically, the given training patterns are used to learn the descriptions of classes, which in turn are used to label a new pattern. In the case of clustering, the problem is to group a given collection of unlabeled patterns into meaningful clusters. In a sense, clusters can be seen as labeled patterns that are obtained solely from the data. Therefore, classification often succeeds clustering, although classification may also be performed without explicit clustering (for example, Support Vector Machine classification, described below). In situations in which classification is performed once the clusters have been identified, new data is typically classified by projecting the data into the multidimensional space of clusters and classifying the new data point based on proximity, that is, distance, to the nearest cluster centroid. The centroid of cluster having a finite set of points can be computed as the arithmetic mean of each coordinate of the points.
The variety of techniques for representing data, measuring proximity between data elements, and grouping data elements has produced a rich assortment of classification and clustering methods.
In Support Vector Machine classification (SVM), when classifying a new data point based on proximity, the distance is taken to the nearest data points coming from the clusters (even though there is no explicit representation of the cluster) called support vectors. Each new data point is represented by a p-dimensional input vector (a list of p numbers) that is mapped to a higher dimensional space where a maximal separating hyperplane is constructed. Each of these data points belongs to only one of two classes. Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. SVM aims to separate the classes with a “p minus 1”-dimensional hyperplane. To achieve maximum separation between the two classes, a separating hyperplane is selected that maximizes the distance between the two parallel hyperplanes. That is, the nearest distance between a point in one separated hyperplane and a point in the other separated hyperplane is maximized.
In fuzzy clustering, data elements can belong to more than one cluster, and cluster membership is based on proximity test to each cluster. Associated with each element is a set of membership levels that indicate the strength of the association between that data element and the particular clusters of which it is a member. The process of fuzzy clustering involves assigning these membership levels and then using them to assign data elements to one or more clusters. Thus, points on the edge of a cluster may be in the cluster to a lesser degree than points in the center of cluster.
In categorical classification methods based on decision tree variants, the classification is based on the likelihood of the data point coming from any of the clusters based on the sharing of attribute values. Using a decision tree model, observations about an item are mapped to conclusions about its target cluster. In these tree structures, leaves represent classifications and branches represent conjunctions of features that lead to those classifications.
Classification using proximity to either centroids of clusters or support vectors is generally inadequate to properly classify data points. To provide for more accurate classification, the shape of the cluster should be taken into account. FIG. 1, illustrating an exemplary clustering of a dataset, demonstrates this problem. The points along the direction of the cluster indicated by W should be more likely to be classified as belonging to this cluster than the set of points indicated by X that are the same distance from the centroid as the points indicated by W. Points lateral to the cluster should be less likely to belong to the cluster than the points at the top edge, even when they have the same proximity to the centroid or support vectors of this cluster.