The present invention relates generally to pattern recognition and more particularly, to a method for classification with neural networks, or their simulations, based on back propagation, but in a new mode where no external supervised training or teaching is employed.
The field of neural networks is directed to developing intelligent machines which use computational methods that are based on mechanisms which are presumed to be related to brain function. Driving this development is a class of problems which are intractable or, so far, not well suited to solution by conventional serially programmed digital computing technology, but often, are easily solved by humans or animals. Broadly, these problems relate to the recognition of patterns such as recognition of different sounds and of various kinds of images (including alphanumeric characters).
Neural network architectures have a massively parallel interconnection of simple identical computational elements which are referred to as neurons. Each neuron may modify the relationship between its inputs and outputs by some operation. The characteristics and processing power of a given neural network are dependent on several factors, including: the connection geometry, the operations used for the interaction between neurons, the learning rules used to modify the connection strengths, and the learning method itself. For pattern recognition, a neural network is taught, or spontaneously learns, to classify input patterns as one of a plurality of classes that the neural network has learned, and then is used to classify additional inputs, each as one of a plurality of classes.
The classical as well as the neural-network process for classifying a population of samples with l properties each, into 2 sub-classes, is often conceptually reformulated so that it is cast in terms of the separation of the samples, each represented by a point in an l-dimensional space, using an (l-1)-dimensional hypersurface, such that all members of one sub-class fall on one side of the hypersurface, and all members of the other class fall on the other side. The l-dimensional line from the origin of the coordinate system to the sample point, specified by the l sample coordinates, is referred to as the l-dimensional sample-vector, input-vector or feature-vector. When l=2, and if the space is Euclidean, then if a straight line (a hyperplane, with (l-1)=1 dimension) can be drawn so that all points of one sub-class fall on one side of the line, and all of the others on the other side, the population is said to be linearly separable.
This definition of linear separability is generalized to spaces of any dimensionality; the/-dimensional population is said to be linearly separable if an (l-1)-dimensional hyperplane can partition the samples into the 2 sub-classes. If it takes a hypersurface other than a hyperplane, (e.g., a hypersphere) to provide separation then the population is not linearly separable. For instance,if l=2, the hypersphere might be a circle with its center in the middle of a cluster of sample points, and its perimeter surrounding the cluster, so that the cluster sub-class is separated from the residual subclass (i.e., samples distributed throughout the rest of the space) by the "hyperspherical surface."
Populations that may be separated by combinations of hyperplanes, however, are said to be piece-wise linearly separable. For example, for l=2, a circle which encompasses a population can be approximated by a number of line segments. Piece-wise linear separation covers a very large category of classification problems. Nevertheless, it also can be easily shown that there can exist "intertwined" connected classes, in l-dimensional space, which cannot be separated by any number of (l-1)-dimensional hyperplanes, and such classes not only are not linearly separable, but further are not piecewise linearly separable.
Two extremes of distributions of samples in an l-dimensional (but not necessarily Euclidean) space are described by:
(1) Samples that are distributed with uniform density throughout the space. In this case, the space can be divided with hypersurfaces into groups of contiguous sample "sub-classes" in an infinite number of ways, all equally "unnatural". Nevertheless, any one of such classifications, though unnatural, may be more useful or more economical for labelling or coding of newly observed samples than using the raw l coordinates of each sample point; and PA1 (2) Samples that are distributed non-uniformly in the space, (i.e., clustered), with large empty gaps between the clusters. Such distributions are amenable to unique partitioning with many kinds of hypersurfaces that pass only through regions free of samples. Members of such kinds of sub-classes can be identified reproducibly with confidence and "meaning". Such sub-classes are "natural sub-classes", and are typical of most named classes generally used in thought and speech (e.g., note the large gaps between chairs, stars and even dogs and cats). PA1 (1) Supervised learning requires presenting the network with a training set of input samples (each sample is represented by a descriptive data set, e.g., an l-dimensional input vector) and an associated label (each label represents a target for the output). The set of corresponding labels are determined according to prior classification performed separately from the neural network by an expert. Typically, this prior classification involves computationally intensive methods and/or extensive human experience. The network learns by adjusting parameters such that the outputs generated by the network in response to the training set of input vectors are within an acceptable error margin compared to the respective expert-supplied targets for the training set of input samples. The trained network is subsequently used to bypass the expert to automatically recognize and classify additional input samples from unlabeled data sets. PA1 (2) In contrast, unsupervised learning requires the automatic "discovery" of clusters of samples among a training set of unlabelled input sample data sets on the basis of some measure of closeness, and the sample population is thereby partitioned into sub-classes which are then labelled, entirely without expert intervention. Subsequently, the network is used to automatically recognize and label additional data sets of unlabelled samples.
During the last ten years, significant and exciting technologies have been described and explored for recognizing and extracting patterned information from data sets for their classification, using neural networks. These new methodologies are now usually referred to under the rubrics of Parallel Distributed Processing or Connectionist technologies, and have arisen mainly from studies of, and interests in the functioning of biological nervous systems. They follow from attempts to model such functions with networks of rather simple processors connected in manners that crudely simulate natural neural nets. Instead of being programmed in detail to do their task, neural networks learn from experience.
Two general types of learning methods for neural networks are supervised and unsupervised learning. These general methods can be described as follows:
Work on supervised learning with neural networks goes back to Rosenblatt's Perceptrons and to Widrow's ADALINE [Rosenblatt, F. Principles of Neurodynamics, Spartan (1962) and Widrow, B., and Hoff, M. E., Jr. "Adaptive Switching Circuits" IRE WESCON Convention Record, pt.4, 96-104 (1960)]. Widrow developed a delta rule which could be used for systematically implementing learning in a two-layer network (having an input layer and an output layer) but was not applicable to multi-layer networks. In 1969, Minsky and Papert, [Minsky, M., and Papert, S. Perceptrons, MIT Press (1969)], proved that 2-layer networks, like those studied by Widrow and by Rosenblatt, fail to separate sub-classes which require a hypersurface more complicated than a hyperplane.
In 1986, Rumelhart, Hinton, and Williams found that learning could be implemented in a multi-layer feedforward neural network [now frequently called a Multi-Layer Perceptron (MLP)], by back propagation of error based on a Generalized Delta Rule, [Rumelhart, D. E., Hinton, G. E., and Williams, R. J. "Learning Representations by Back Propagating Errors", Nature, 323,533-536 (1986)]. Essentially the same delta rule was independently developed by Werbos [Werbos, P., Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences, Ph.D. Thesis, Harvard University, August, 1974], by Parker [Parker, D. B. "Learning-Logic" TR-47 Massachusetts Institute of Technology, Center for Computational Research in Economics and Management Science, (1985)], and by le Cun [le Cun, Y. "Learning processes in an asymmetric threshold network", in Disordered Systems and Biological Organization, F. Soule, E. Bienstock and G. Weisbuch, Eds., (Springer-Verlag, Les Houches, France, 1986) pp. 233-340]. Currently, the back propagation method, referred to simply as "backpropagation" or "back propagation," is perhaps the most popular method of supervised learning.
Back propagation automatically acquires internal sets of connection-weights or synaptic states which, when three or more layers are used, permits even the separation of samples that map nonlinearly from layer to layer, that exhibit complex connectivity and that are even not linearly separable. Rumelhart, Hinton, and Williams have called the complex internal states generated by back propagation "internal representations" [Rumelhart, D. E., Hinton, G. E., and Williams, R. J. "Learning Internal Representations by Error Propagation", in D. E. Rumelhart and J. L. McClelland (Eds.) Parallel Distributed Processing MIT Press, 318-362 (1986)].
In a "layered" back propagation network, a number of layers of neurons or nodes are connected in series. The progression is from input layer to output layer. Typically, each node in the input layer is connected to every node in the next layer, and this pattern is repeated through the layers, terminating with a final output layer. All layers between input and output layers are known as "hidden" layers. A signal passing through a connection (often referred to as a synapse), is multiplied by a connection-specific weight. The weighted signals entering a node are summed, a bias is added and then the signal is transformed non-linearly by a threshold [usually a sigmoid function of the form, S.sub.o =(1+e.sup.-.SIGMA.).sup.-1, where .SIGMA. is the biased sum of weighted signals entering the node, and S.sub.o is the output signal from that node] and the output signals are fed to the next layer, until the final output layer is reached.
Supervised learning in such networks is accomplished by adjusting the individual connection-specific weights until a set of trained weights is capable of transforming each and every member of the input-training-set into an output vector which matches its target label within some prescribed level of precision. The network has then "learned" the classification it was taught, and has acquired the useful ability to thereafter rapidly classify new unlabelled samples drawn from the same general class as the training set, into the appropriate learned sub-classes. The procedure for adjusting weights, which is the key to the power of back propagation, is the Generalized Delta Rule, and this depends intimately upon the labelled targets provided by an expert teacher. Thus, the ability of the network to classify new unlabelled inputs depends completely on the prior classification of a teacher-labelled training set.
Unsupervised learning methods for neural networks are associated with the names of Grossberg and Kohonen [e.g., see Carpenter, G., and Grossberg, S. "The ART of Adaptive Pattern Recognition by a Self-Organizing Neural Network", Computer [March] 77-87 (1988); and Kohonen, T. Self-Organizing and Associative Memory, Springer Series in Information Science, 8 (1983)]. The ART 1 and ART 2 of Grossberg and the networks of Kohonen lack hidden layers, and therefore, like Rosenblatt's Perceptron, are subject to the same limitations of 2-layer networks noted by Minsky and Papert. Also, unsupervised neural networks generally have been more difficult to implement and have performed more poorly than multi-layer back propagation networks.
Unsupervised learning with non-neural-net approaches has also been discussed for a long time (e.g., Ornstein, L. "Computer Learning and the Scientific Method: A Proposed Solution to the Information Theoretical Problem of Meaning" J. Mt. Sinai Hosp. 32, 437-494 (1965)). Ornstein presented a procedure for generating hierarchical classifications based on informationally-weighted similarity measures. Neural network approaches have also employed hierarchical procedures, such as Ballard's method for minimizing the so-called "scaling problem", (the slower-than-linear decrease in the rate of convergence and therefore, the greater-than-linear increase in the learning-time with increasing numbers of network layers), by stacking separate networks in series, (Ballard, D. H. "Modular Learning in Neural Networks," Proceedings of the Sixth National Conference on Artificial Intelligence, 1, 279-284 (1987)), and the work of Sankar and Mammone on Neural Tree Networks, (Sankar, A., and Mammone, R. J. "Growing and Pruning Neural Tree Networks," IEEE Transactions on Computers 42, 291-299 (1993)). A review of hierarchical methodologies is provided by Safavian et al. (Safavian, S.R., and Landgrebe, D. "A Survey of Decision Tree Classifier Methodology", IEEE Transactions on Systems, Man and Cybernetics 21, 660-674 (1991)). Although hierarchical methodologies represent a powerful means for classification, their application to unsupervised neural networks has been limited.
Despite the progress and effort made in the field of neural networks, particularly since the advent of back propagation, further developments and improvements in neural network learning methods, and particularly, advances in unsupervised learning methods, continue to be pursued by researchers in this field.
In particular, it would be advantageous to have a method for implementing learning in a neural network which combines the attributes of unsupervised learning with the convenience and power of back propagation. However, there has been only the slightest suggestion that back propagation could be used for unsupervised learning, (i.e., Zipser, David, and Rumelhart, David E., "The Neurobiological Significance of the New Learning Models in Computational Neuroscience," E. L. Schwartz, Ed., MIT Press, Cambridge, 1990, pp. 192-200). Indeed, M. Caudill [Caudill, M. "Avoiding the Great Back-propagation Trap" AI Expert Special Edition (January), 23-29 (1993)] states: "If you have no information at all to give the network, you are automatically constrained to unsupervised training schemes "and " if you want to use unsupervised training, you have completely eliminated backpropagation networks."
Accordingly, an object of the present invention is to provide a method for unsupervised neural network classification based on back propagation.
A related object of the present invention is to provide an improved learning method for neural networks that does not require prior classification of the training set.
A further object of the present invention is to provide an unsupervised back propagation neural network learning method which includes a method for automatically discovering natural classes.
Yet another related object of the present invention is to provide an unsupervised learning method, including hierarchical classification, for discovering and efficiently classifying natural classes among sets of classes using feedforward multilayered networks.
The foregoing specific objects and advantages of the invention are illustrative of those which can be achieved by the present invention and are not intended to be exhaustive or limiting of the possible advantages which can be realized. Thus, these and other objects and advantages of the invention will be apparent from the description herein or can be learned from practicing the invention, both as embodied herein or as modified in view of any variations which may be apparent to those skilled in the art.