1. Technical Field
This invention relates to systems and methods for clustering data and, more particularly, to a system for clustering data which adapts to particular characteristics in the data to be clustered.
2. Discussion
One of the most fundamental tasks of sensory processing is to cluster data into manageable and meaningful groupings. It is clear that humans perceive groupings, or "gestalts", in sensory input with a facility that far exceeds the capabilities of today's pattern recognition systems. For this reason, the ability to emulate the human perception of "gestalts" automatically would be highly desirable. For example, in the field of image processing where fields of picture elements (pixels) are automatically scanned and enhanced, it is desirable to collect groups of pixels into individual objects to recognize certain features or textures within the field. Another application for clustering is in object tracking from sonar or radar range and doppler data. In this setting, multiple range/doppler cells are obtained for each object and it is desirable to collect these multiple returns into groups corresponding to individual objects, rather than establishing a track for each cell. An additional application for clustering includes the field of taxonomy in biology, where measurements on a set of organisms are grouped in a way which reflects similarity based on the measurements.
In general, clustering can be applied to any one, or greater, dimensional collection of points where it is desired to break up the data field into meaningful and manageable segments. The clustered data may be useful by itself, or it may be used to simplify further signal processing and decision making. A common problem with previous approaches to clustering is that they are highly parametric; that is, certain key parameters must be provided in advance. These parameters may include, for example, the number of objects, the size of the object, and the separation between the objects. This is a problem because parametric clusterers usually require more information about the data field than is usually practically available. These parameters are almost never known in advance, since they are data dependent.
In previous clustering systems, these clustering parameters are usually obtained by "training" the algorithms on data sets similar to the expected situation. This, in general, produces very sensitive algorithms that work well on the laboratory generated training data, but perform poorly in environments other than the exact ones for which they were designed. There are several existing clustering algorithms, all of which require substantial prior information about the groupings. Some of these clustering approaches are described in the following references:
1) Duda, R. and Hart, P., Pattern Classification and Scene Analysis, John Wiley & Sons, N.Y. 1973.
2) Hartigan, J., Clustering Algorithms, John Wiley & Sons, N.Y., 1975.
3) Koontz, W., Narendra, P., and Fukunaga, K., "A Graph Theoretic Approach to Nonparametric Cluster Analysis", IEEE Transactions on Computers, Vol. C-25, No. 9, pp. 936-944, September 1976.
4) Gitman, I., and Levine, M., "An Algorithm for Detecting Unimodal Fuzzy Sets and Its Application as a Clustering Technique", IEEE Transactions on Computers, Vol. C-19, No. 7, pp. 583-593, July, 1970.
5) Zahn, C., "Graph-Theoretic Methods for Detecting and Describing Gestalt Clusters", IEEE Transactions on Computers, Vol. C-20, No. 1, pp. 68-86, January 1971.
6) Friedman, H., and Rubin, J., "On Some Invariant Criteria for Grouping Data", American Statistical Association Journal, pp. 1159-1178, December, 1967.
The algorithms treated in these references, generally require specifying in advance the number of clusters and cluster distances. In general, however, in most situations this information is not known. The real-world problem of mismatching the assumed number of clusters to the actual case is not resolved by prior approaches. For example, many clustering techniques deal with essentially how to partition N objects into M groups. One of the basic approaches in the texts is to form a least squares fit of the data points to the pre-specified number of groupings. (See reference Nos. 1 and 2, above). Other approaches include mode-seeking, valley-seeking, and unimodal set algorithms. The graph theoretic approach is treated in reference No. 3. It connects the data points in a tree structure based on certain parameters that must be specified in advance.
To avoid specifying parameters in advance, some clustering approaches have been developed with the ability to adapt to a "training" set of data. However, the real world data will never have the same statistical distribution of points as in the training set. Consequently, even approaches that are tuned to a training set fail in real world situations.
Thus it would be desirable to provide a clusterer that does not rely on predefined parameters. In particular, it would be desirable to provide a clusterer in which no parameters are specified in advance, but are, instead extracted from the actual observed data. Further, it would be desirable to provide a clusterer which can adapt to the real world data field, rather than to an artificial training set.