1. Field of the Inventnion
The present invention relates generally to the field of data processing and, more particularly, to a method and system for construction of similarity matrices for data sets in high-dimensional space of attributes with the purpose of data clustering on a dimensionless basis.
2. Description of the Related Art
Almost any computer application involves, to some extent, the use of some kind of procedure for establishing similarity-dissimilarity relationships. However, it is especially important in clustering, whose purpose is the grouping of data in accordance with proximities between data points. A large and constantly growing variety of academic disciplines and practical applications, in which clustering methods play a definitive role, call for increasing attention to the key component of such methods—the problem of adequate accuracy in establishing similarities and dissimilarities. In particular, an important yet not fully solved problem is represented by a situation when analyzed objects are characterized in high-dimensional data space but certain variables are not highly correlated. Searching for compact index representation of multi-dimensional raw data is a subject of numerous patents and other publications (cf., for example: Aggarwal, et al. U.S. Pat. No. 6,505,207; Aggarwal, et. al. U.S. Pat. No. 6,307,965 B1; Ravi, K. V., et al. Dimensionality reduction for similarity searching in dynamic databases. Proceedings of the ACM SIGMOID Conference. 1998; Ostrovsky E. U.S. Pat. No. 5,970,421; Martin, et. al. U.S. Pat. No. 6,260,038 B1; Boyack, et al. U.S. Pat. No.6,389,418; Castelli, et al. U.S. Pat. No.6,122,628). Nevertheless, none of the existing approaches may be referred to as generally recognized and universal, and the heretofore proposed methods, as a rule, automatically lead to oversimplification and approximation, involve a multitude of stages and, therefore, are computationally expensive, and cannot be used in a fully automated unsupervised mode, which is an extremely important requirement to most of modern computer applications. Lately, it has also become clear that mathematical statistics methods, with all of their capabilities and versatility, can no longer be considered as the basis for development of routine technology for establishing similarities and dissimilarities in unsupervised mode. This is especially true for those cases when there is no a priori knowledge about the data structure.
Some of the approaches applied in many of the widely used applications for the purpose of establishing similarity-dissimilarity of objects described in high-dimensional space of attributes clearly represent a forced solution used for the lack of proper techniques and are simply nonsensical. For instance, there is a widely known notion of the “curse of dimensionality” which refers to a dramatic dependency of parameterization of distances between attributes on their dimensionality (Bellman, R. 1961. Adaptive Control Process: A Guided Tour. Princeton University Press. Cf also: Clarkson, K. An algorithm for approximate closest-point queries. In: Proceedings of the Tenth Annual ACM Symposium on Computational Geometry. 1994, pp. 160–164). Understandably, this dependency catastrophically increases in a super-space, resulting in a situation when the most that can be done about similarities-dissimilarities is the standardization of conditions for comparison of similarities on a presumption that “objects in a set have otherwise equal status”, which by definition cannot be considered as an acceptable methodological platform.
Attributes that are used for description of sets of objects and constitute an important part of a data base are usually referred to as either categorical, or binary, or real numeric (continuous) data. This kind of typification may be rather important from the viewpoint of conventional data clustering techniques as some of them perform well when all of the data points of a data set contain the same type of attributes. In the meantime, in practice, by far more important distinctions between categories of attributes evaluation oftentimes evade the attention. For instance, in some systems, proportional changes in values of variables do not change the shapes of sets of points, or change them in such a way that the core system code remains unchanged, as, for example, in descriptions of a human shape in various positions. In such cases, metrics as Euclidean distances, or city block metric, etc. can be handily applied. However, comparison of incomes based on distances is never unambiguous: for example, the difference between annual incomes of $15,000 and $35,000 is the same as between $55,000 and $75,000, and from the viewpoint of, for instance, financing institutions, evaluation of a borrower's financial status (i.e. his financial “shape”) based on his annual income is quite logical and acceptable; however, from the point of view of human logic and reality or, more precisely, from the viewpoint of financial survival power, the annual earning amounts of $15,000, $35,000, and $55,000–75,000 clearly represent three substantially different levels, which is so obvious that it does not require further explanation. In other words, the distance between the income of $15,000 and that of $35,000 from the viewpoint of “power” is by far greater that the difference between $55,000 and $75,000.
The use of distances may be illogical, for example, in establishing the dissimilarities between concentrations of fatty acids in bacterial membranes (Andreev et. al. In: The Staphylococci. J. Jeljaszewicz (Ed.) Gustav Fisher Verlag, New York, pp. 151–155, 1985), or between fractions of population pyramids which represent slowly changing pseudo-equilibrium systems even though they may often have a distinct, country-specific shape. There are plenty of other examples to the same effect. For instance, in climate studies, while mean temperature comparison based on distances obviously makes sense, data on relative humidity compared with the use of the Euclidean distances simply do not correlate with the known concepts about the physics of the atmosphere. Put simply, such things and shape and power cannot be evaluated based on same criteria.
The traditional approach to the establishing of similarities of objects in a high-dimensional space of attributes that are heterogenic according to their physical and physico-chemical nature takes its roots from the so-called numerical taxonomy dating back to the 60's of the past century. At that time, instrumented methods in physico-chemical analysis, such as chromatography and optical and mass-spectroscopy were being actively developed, and big progress was made in biochemical analysis and enzymatic testing of biological objects, particularly, microorganisms. Researchers in systematics and identification of microorganisms suddenly found themselves having access to huge databases, and that circumstance was the major factor that boosted the transformation of methods of numeric analysis into a widely recognized methodology, thus “promoting” biologists from the level of a descriptive science or almost humanities art to a level close to exact sciences. As a thorough study into the methodology of a proper use of each individual feature in multitudes of features that became available for the use in taxonomic research demanded serious efforts and was time-consuming, it is quite understandable, though unfortunate, that the principle of numerical taxonomy that allowed for fast and easy computerized utilization of a huge stream of information.
The concept of numerical taxonomy is simple and is based on establishing characters that are common for each of objects under comparison, on the one hand, and characters that are inherent only in one of the objects in a pair under comparison (cf e.g. Sneath, P. H. A., Sokal R. R. 1962. Numeric Taxonomy. Nature 193: 853–860; Sneath, P. H. A., Sokal, R. R. 1973. Numeric Taxonomy. W. F. Freeman and Company. San Francisco, Calif.). The above idea—without any regard to a type of mathematical formalization applied—is all what is put into the foundation of establishing similarity criteria in numerical taxonomy. Not only is the said approach naive, but the fact is that even a more naive idea is used for support that approach: using hundreds and thousands of attributes in a hope that by the end of the day something may emerge that should correlate with the “natural interrelationship” between organisms under study. This a priori unscientific idea of quantitative comparison between qualitatively different attributes has been accepted and approved by the research practice in numerical taxonomy.
Coming back to the clustering of objects in a high-dimensional space—the concept of applying Euclidean distances to a n-dimensional space has been a result of the influence of the above discussed principles of “taxonomy in bulk”, being, in fact, a victory over common sense as it is obvious and clear that it is simply unscientific to try to determine proximities between object's features when they are expressed in different units of measurements. As an example, we will mention one of the commercially available computer programs for clustering, Clustan (Clustan Limited, UK), a guidebook to which contains an example of clustering of 25 species of mammals based on contents of ash, lactose, fat, protein, and water in the milk of respective mammals (Wishart, D. 1999. ClustanGraphics Primer: A Guide to Cluster Analysis. Edinburgh, Scottland). Even if the use of Euclidean distances between ash and water in the milk of mammals may accidentally (or not accidentally) result in clustering wherein objects are attributed in accordance with the currently accepted scientific views, common sense may not settle for such an approach, and the aforementioned example may not be used for verification of the validity of a given clustering technique.
The purpose of the present invention is development of a universal system for computation of similarity-dissimilarity to provide for efficient clustering of objects described in a high-dimensional data space. The proposed method most effectively works in cooperation with the invention specified in patent application Ser. No. 09/655,519 by Leonid Andreev, now pending. The above-referenced method for evolutionary transformation of similarity matrices (ETSM) has a number of features that make it related to the so-called “neural network” technology; however, ETSM has an important advantage over other clustering methods: by employing standard operations, it provides data systems with an opportunity for “self-expression” or self-evolution in accordance with their original complexity, resulting in hierarchical clustering wherein a number of clusters and interrelation between them is not determined by an operator or program developer but is revealed by a data system itself. That allows for highly accurate and independent verification of the techniques used for preparing the data for clustering and gives a feedback on whether or not a used technique is appropriate. Therefore, in the context of the present invention, we will discuss the basics of ETSM method.
The method for evolutionary transformation of similarity matrices consists in the processing, in one and the same fashion, of each cell of a similarity matrix so that a similarity coefficient between each pair of objects in a data set is replaced by a ratio of a similarity coefficient between each of object in a pair and the rest of the objects. The algorithm for such transformation is repetitively applied to a similarity matrix till each of similarities between objects within each of the clusters reaches 1 (or 100%) and no longer changes. In the end, the process of successive transformations results in convergent evolution of a similarity matrix. First, the least different objects are grouped into sub-clusters; then, major sub-clusters are merged as necessary, and, finally, all objects appear to be distributed among the two main sub-clusters, which automatically ends the process. Similarities between objects within each of the main sub-clusters equal 1 (or 100%), and similarities between objects of different sub-clusters equals a constant value which is less than 1 (or less than 100%). The entire process of transformation may occur in such a way that while similarities within one sub-cluster reach the value of 1 (or 100%) and stop transforming, another sub-cluster still continues undergoing the convergent changes and take a considerable number of transformations (in which the objects of another sub-cluster are no longer involved). Only after the convergent transformation of the second sub-cluster is complete, i.e. when similarities between its objects reach 1 (or 100%), and similarities between objects of the two sub-clusters clusters is less than 1 (or 100%), an entire process of evolutionary transformation of a similarity matrix is over. In the described process, there is no alternative to the sub-division of all objects of a data set into two distinctive sub-clusters. Any object that may represent a “noise points” for any of the major groups of objects in a data set of any degree of dimensionality gets allocated to one of sub-clusters.
Conversely, the above described convergent evolution may also be represented as divergent evolution and reflected in the form of a hierarchical tree. However, the mechanism of the algorithm for evolutionary transformation involves the most organic combination of the convergent and divergent evolution (or deduction and induction based on input information about objects under analysis). For that purpose, each of the sub-clusters formed upon completion of the first cycle of transformation is individually subjected to transformation, which results in their division into two further sub-clusters, respectively, as above described; then, each of the newly formed four sub-clusters undergoes a new transformation, and so on. This process, referred to as ‘transformation-division-transformation’ (or TDT) provides for the most rational combination of the convergent (transformation) and divergent (division) forms of the evolution process, in the result of which an entire database undergoes multiple processing through a number of processes going in opposite directions. The said combination of processes is not regulated and is fully automated and unsupervised; it depends on and is determined by only the properties of a target similarity matrix under analysis, i.e. by input data and an applied technique of computation of similarity-dissimilarity matrices.
In other words, the ETMS algorithm is based on uncompromising logic that cannot be manipulated by arbitrarily introduced commands, which results in the fact that the efficiency of the ETMS-method greatly depends on how adequate and scientifically well-grounded are the techniques used in presentation of input data (i.e. computation of similarity matrices). At the same time, such a sensitivity of the ETMS-method to the quality of input data makes it especially reliable as a criterion of suitability of methods for determination of similarities/dissimilarities.