The volume and types of data produced by biological science and theoretical chemistry are vast. Such fields as protein conformation, chemical and protein structure and activity; genomic sequences, gene expression and phenotype; and population and disease incidence and prevalence yield large amounts of interrelated data that must be organized and interpreted to be useful.
A variety of methods have been designed to “cluster”, or organize, large amounts of technical data, including that relating to three-dimensional molecular forms.
Algorithms are used to perform the enormous number of decision-making steps required to organize shape data. Examples include: “DBSCAN”, proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96): 226-231, in 1996; “OPTICS”, proposed by M. Ankerst et al. Proc. ACM SIGMOI'99 Int. Conf. on Management of Data, Philadelphia Pa., 1999; “K-means”, proposed by J. A. Hartigan et al in “A K-Means Clustering Algorithm”. Applied Statistics 28 (1): 100-108, 1979; “K-medoid”, referenced at http://en.wikipedia.org/wiki/K-medoids; “FLAME” proposed by L. Fu et al. in BMC Bioinformatics 2007, 8:3, 2007; “G_cluster/grooms” proposed by Daura et al. in Angew. Chem. Int. Ed. 1999, 38, pp 236-240, 1999; “DCBOR”, proposed by A. M. Fahim, G. Saake, A. M. Salem, F. A. Torkeyand, M. A. Ramadan. in Proceedings of World Academy of Science, Engineering and Technology, Vol. 35, November 2008; “DENCLUE”, proposed by Alexander Hinneburg and Daniel A Keim in An Efficient Approach to Clustering in Large Multimedia Databases with Noise. Institute of Computer Science University of Halle Germany, 1998; “SUBCLU”, Karin Kailing, Hans-Peter Kriegel and Peer Kroger In Proc. SIAM Int. Conf. on Data Mining (SDM'04), pp. 246-257, 200, 2004; and CLIQUE, proposed by Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan of IBM Almaden Research Center, 1998.
These methods have various limitations, such as being unable to cluster data with varying densities across a volume, yielding inconsistent results depending on the input order, producing clusters based on shape and not density, or being confined to smaller datasets. Of these, FLAME and DENCLUE share the advantage of having only one required parameter and defining arbitrarily-shaped clusters.
The FLAME algorithm starts with a neighborhood graph to connect each data point to its K-nearest neighbors, estimates a density for each object based on its proximities to its K-nearest neighbors, and any data point with a density higher than all its neighbors is assigned full membership to itself. Remaining data points are assigned equal membership weights to all clusters defined, and then membership weights are updated for all points as a linear combination of the membership weights of its neighbors. This process is iterated to convergence, whereupon each object is assigned to the cluster in which it has the highest membership. FLAME needs many iterations and is inefficient and time consuming.
In DENCLUE, each data point is assigned an influence function that determines the effect of itself on the surrounding areas. A typical influence function can be a Gaussian distribution centered around the point. The algorithm sums together the influence functions of all points and then proceeds to find local maxima on this new hypersurface. Cluster centers are located at these maxima. The cluster to which points belong is found via a steepest ascent procedure on this hypersurface. An efficient implementation of this algorithm is very complex.
U.S. Pat. No. 6,226,408 by Sirosh discloses an unsupervised learning routine for use in analyzing credit card transactions for fraudulent activity. A number of data types are converted to numerical values and grouped accordingly.