1. Field of Invention
The present invention relates generally to data processing systems, and more particularly to data processing systems for data mining, pattern recognition, and data classification using unsupervised identification of complex data patterns.
2. Background of the Invention
Statistical classification (or "learning") methods attempt to segregate bodies of data into classes or categories based on objective parameters present in the data itself. Generally, classification methods may be either supervised or unsupervised. In supervised classification, training data containing examples of known categories are presented to a learning mechanism, which learns one more sets of relationships that define each of the known classes. New data may then be applied to the learning mechanism, which then classifies the new data using the learned relationships. Examples of supervised classification systems include linear regression, and certain types of artificial neural networks, such as backpropagation and learning vector quantization networks.
Because supervised learning relies on the identification of known classes in the training data, supervised learning is not useful in exploratory data analysis, such as database mining, where the classes of the data are to be discovered. Unsupervised learning is used in these instances to discover these classes and the parameters or relationships that characterize them.
Unsupervised classification attempts to learn the classification based on similarities between the data items themselves, and without external specification or reinforcement of the classes. Unsupervised learning includes methods such as cluster analysis or vector quantization. Cluster analysis attempts to divide the data into "clusters" or groups that ideally should have members that are very similar to each other, and very dissimilar to members of other clusters. Similarity is then measured using some distance metric which measures the distance between data items, and clusters together data items that are closer to each other. Well-known clustering techniques include MacQueen's K-means algorithm, and Kohonen's Self-Organizing Map algorithm.
One of the as of yet unattained goals of unsupervised learning has been a general learning method that could be applied in multiple stages to discover increasingly complex structures in the input. A specific case of this problem is that of identifying, or separating, highly nonlinear subspaces of data, perhaps surrounded by noise.
The unsupervised identification of nonlinear subspaces of data is an important problem in pattern recognition and data mining. Many data analysis problems reduce to this identification problem; examples include feature extraction, hierarchical cluster analysis and transformation-invariant identification of patterns. Although work in this field dates back to the 1930s, good solutions exist only for cases where the data occur in approximately linear subspaces, or in cases where the data points group into well-separated, disjoint classes. Instead, if the data points lie along very nonlinear subspaces and is clouded by noise, conventional unsupervised methods fail. In particular, such nonlinear subspaces are more likely in multidimensional real-world data.
Accordingly, it is desirable to provide a system and method of unsupervised learning that is particularly suited to identifying non-linear clusters of data in highly dimensional data, and thus suited for data exploration, such as database mining, or feature extraction, and similar applications.