This invention relates generally to multivariate statistical analysis and, more specifically, to clustering techniques used to analyze statistical data. Clustering is also known by the terms unsupervised learning, and categorization. Prior art clustering techniques include the spanning tree method and the expectation-maximization method.
Clustering is based on the reasonable assumption that things having similar attributes also have similar measured characteristics. For example, biologists may categorize biological specimens based on their measured characteristics, which may be plotted in an n-dimensional grid. Specimens with similar measured characteristics falling into a statistical "cluster" of data points on the grid may be defined as belonging to a specific category of biological entity.
A similar analysis may be used to categorize stocks traded in a stock market. The measured characteristics may include share price, price-to-earnings ratio, price volatility, and so forth. When various stocks are sampled and their characteristics are plotted on an n-dimensional grid, categories of stocks emerge from the resulting clustering of patterns of data points. Such categories may be used to identify candidate stocks for purchase or sale.
Clustering techniques can be used in a variety of other fields, including signal analysis and identification, pattern recognition, geological resource exploration, marketing research, and identification of persons by analysis of fingerprints, voice patterns, retinal patterns, or some other form of biometric analysis. Clustering techniques encounter two key problems that are common to all of these applications: cluster proximity and cluster count. In particular, it is sometimes difficult to separate and distinguish clusters that are close together and may appear to overlap. Moreover, there may be some measured data points that do not fall clearly within cluster regions that have already been identified, and these raise the issue of whether to ignore the new data points, or to include them in a selected existing cluster, thereby, perhaps, extending the boundaries of the cluster, or to define a new cluster. Some clustering techniques require knowledge of the number of clusters, which is often not known and is difficult to determine. Other clustering problems include inhomogeneity of cluster density of size, and clusters of unusual shapes, such as crescents or rings. Another practical difficulty inherent to available clustering techniques is that the processing time needed is proportional to the number of data points squared, cubed, or raised to some other power. For example, the spanning tree algorithm used for clustering has a processing time proportional to N.sup.3, where N is the number of data points being analyzed. The spanning tree approach has the additional drawback that it cannot easily distinguish between categories that are too close together.
Because clustering has such a diverse range of applications, there is clearly a need for a new approach to clustering that can handle larger numbers of data points and categories, and can handle categories that may otherwise be statistically indistinguishable. The present invention is directed to this end.