The present invention relates to a method and apparatus for clustering multidimensional data and, more particularly, to a method and apparatus for clustering multidimensional data incorporating quantum mechanical techniques.
There is a growing emphasis on exploratory analysis of large datasets to discover useful patterns. Organizations are investing heavily in “data warehousing” to collect data in a form suitable for extensive analysis, and there has been extensive research on clustering.
Informatics is the study and application of computer and statistical techniques for the management of information.
Bioinformatics includes the development of methods to search biological databases fast and efficiently, to analyze the information, and to predict structures which appear to exist within the data. Increasingly, molecular biology is shifting from the laboratory bench to the computer desktop. Advanced quantitative analyses and computational algorithms are needed to explore the relationships between data entries, thereby to recognize and classify fully or partially the database.
Numerous databases in general and biological databases in particular include large sequences of data, which need to be recognized, classified, and/or grouped into families. In the past, information could only be of assistance for human experts who would thoroughly research the output of database searching programs and would create a grouping according to families. Certainly, this method is time-consuming, labor-intensive and not very reproducible. Nevertheless, the diversity of different families often varies and families are not always exactly defined, hence the task of automated data grouping is not at all trivial.
Given a very large set of multi-dimensional data points, the data space is usually not uniformly occupied by the data points. Instead, some regions in the space are sparse while others are crowded. A clustering method identifies the sparse and the crowded regions, and discovers the overall distribution patterns of the dataset. Therefore, by using clustering methods, a better understanding can be obtained of the distribution patterns of the dataset and the relationship patterns among data attributes to improve data organizing and retrieving. It is also possible to visualize the derived clusters much more efficiently and effectively than the original dataset. Indeed, when the dataset is very large and the dimensions are higher than two, visualizing the whole dataset in full dimensions is almost impossible.
Numerical taxonomy relates to classification methods using numerical characteristics of individuals and populations. Over the years, numerical taxonomy methods have been developed using abstract objects which are not tied to any particular context, but rather can be applied to various data types. Known prior art clustering methods, that divide the data according to natural classes present in it, have been used in a large variety of scientific disciplines and engineering applications that include pattern recognition, learning theory, astrophysics, medical image and data processing, image compression, satellite data analysis, automatic target recognition, speech and text recognition, classification of diseases in medicine, grouping of chemical compounds, such as nucleic acids and proteins, classification of statistical findings for social studies and other types of data analysis.
Many clustering methods are known in the art. The methods are based on a variety of mathematical and/or physical principles. In graph theory methods, each data entry in the database is represented as a vertex on a graph, and similarity measures between data entries are represented as weighted edges between vertices. Clusters are formed by iterative deletions of edges, and by constructing a minimal spanning tree of the graph.
In density estimation methods, the entire database is represented as points in a space which is defined by the characteristics of the data entries. If the data is not completely random, different regions in the data space have different density of points. Clusters of data are viewed as high density regions separated by low-density regions. An example of a density estimation method is the so called scale-space clustering disclosed in an article authored by Roberts S. J., entitled “Parametric and non-parametric unsupervised cluster analysis”, and published in Pattern Recognition, 30(2):261-272 (1997). In this method, the probability density function is estimated using a set of Gaussian kernels sited at each data point. The clusters are located near maxima of the density function or near zero-crossing of its spatial derivative.
Another clustering method employs the laws of physics in order to identify clusters in a database. An example is disclosed by Blat et al. in U.S. Pat. No. 6,021,383. According to Blat et al., data points are associated with physical quantities called Potts-spins. Ferromagnetic interactions are introduced between each pair of neighboring spins and the strength of these interactions decreases with increasing distance or dissimilarity between points.
The two main clustering approaches are called hierarchical and partitional. In hierarchical methods, the data are organized in a “nested” sequence of groups. Hierarchical clustering is a procedure which iteratively adjusts the number of clusters by either merging small clusters or splitting large clusters of data points. Different hierarchical methods employ different decision rules for merging or splitting clusters. The end result of a hierarchical method is a tree of clusters called a dendrogram, which shows the relation between the final clusters. Before completing the analysis, a decision has to be made about an optimal position to cut the dendrogram in order to retrieve the number of clusters existing in the data.
Hierarchical methods have been successfully applied to many biological problems, e.g., for producing taxonomies of animals and plants. However, hierarchical methods have a rather large complexity which grows as a cubic power of the total number of objects which are clustered. Moreover, hierarchical methods are not suitable to all kinds of databases, as the basic feature of any hierarchical method is to impose a hierarchy on the data, while such property of the data may not exist at all. An additional drawback of hierarchical methods is that once two objects are merged, these objects always belong to one cluster, and once two objects are separated, these objects are never re-grouped into the same cluster. Thus, in hierarchical methods motion within the data space is limited. Still another drawback of hierarchical methods is a tendency to cluster together individuals linked by a series of intermediates. This property, generally known as chaining, often gives poor results in cases where noisy data points are present.
Unlike hierarchical methods, partitional clustering methods attempt to directly decompose the data set into a set of disjoint clusters. These methods minimize some local or global criterion function that may emphasize the structure of the data. Very often, clusters which are found by a partitioning method are more similar than the clusters which are found by a hierarchical method, hence partitional clustering provides more qualitative results. Most of the partitional methods rely, implicitly or explicitly, upon some assumptions. However, like in hierarchical methods, data may not conform to these assumptions and an incorrect structure of the data may be obtained. Another difficulty, also encountered in hierarchical method, is the necessity to estimate an optimal number of clusters, before completing the analysis.
An example of a partitional method is the so called K-means algorithm. By a successive sequence of iterations, the K-means algorithm aims to minimize some criterion, which is typically the sum of the squares of the distances from all the data points in the cluster to their nearest cluster centers. The main advantage of the K-means algorithm is the low complexity which is achieved once the number of clusters is determined. However, when clustering data using the K-means algorithm, the number of clusters must be determined a-priori, and sometimes affects the quality of the results. The K-means algorithm intrinsically assumes spherical shape of all the clusters, which of course may not be correct. Like many other iterative procedures, not necessarily related to clustering methods, the K-means algorithm may be locked in some local minima and may not converge to the desired global minimum. Although several procedures have been employed to try and overcome the local minima problem, so far none guarantees finding the global minimum.
Hence, all the known clustering methods detailed above, suffer from one or more limitations which may commonly be attributed to assumptions and decisions which are made in advance; a predetermined structure of the data even though it may be erroneous; and a predetermined number of clusters, which may affect the quality of the results.
The present invention provides solutions to the problems associated with prior art clustering techniques.