In the terminology of pattern recognition, neural networks and machines learning, a feature vector is a transformation of a measurement vector, whose components are measurements or sensor outputs. This invention is mainly concerned with processing feature vectors and sequences of feature vectors for detecting and recognizing spatial and temporal causes (e.g., objects in images/video, words in speech, and characters in handwriting). This is what pattern recognition, neural networks and machines learning are essentially about. It is also a typical problem in the fields of computer vision, signal processing, system control, telecommunication, and data mining. Example applications that can be formulated as such a problem are handwritten character classification, face recognition, fingerprint identification, DNA sequence identification, speech recognition, machine fault detection, baggage/container examination, video monitoring, text/speech understanding, automatic target recognition, medical diagnosis, prosthesis control, robotic arm control, and vehicle navigation.
A good introduction to the prior art in pattern classification, neural networks and machine learning can be found in Simon Haykin, Neural Networks and Learning Machines, Third Edition, Pearson Education, New Jersey, 2009; Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer Science, New York, 2006; Neural Networks for Pattern Recognition, Oxford University Press, New York, 1995; B. D. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, New York, 1996; S. Theodoridis and K. Koutroumbas, Pattern Recognition, Second Edition, Academic Press, New York, 2003; Anil K. Jain, Robert P. W. Duin and Jianchang Mao, “Statistical Pattern Recognition: A Review,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 1, January 2000; R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, second edition, John Wiley & Sons, New York, 2001; and Bernhard Scholkopf and Alexander J. Smota, Learning with Kernels, The MIT Press, Cambridge, Mass., 2002.
Commonly used pattern classifiers include template matching, nearest mean classifiers, subspace methods, 1-nearest neighbor rule, k-nearest neighbor rule, Bayes plug-in, logistic classifiers, Parzen classifiers, Fisher linear discriminants, binary decision trees, multilayer perceptrons, radial basis networks, and support vector machines. They each are suitable for some classification problems. However, in general, they all suffer from some of such shortcomings as difficult training/design, much computation/memory requirement, ad hoc character of the penalty function, or poor generalization/performance. For example, the relatively more powerful multilayer perceptrons and support vector machines are difficult to train, especially if the dimensionality of the feature vectors is large. After training, if new training data is to be learned, the trained multilayer perceptron or support vector machine is usually discarded and new one is trained over again. Its decision boundaries are determined by exemplary patterns from all classes. Furthermore, if there are a great many classes or if there are no or not enough exemplary patterns for some “confuser classes” such as for target and face recognition, training an MLP or SVM either is impractical or incurs a high misclassification rate. Camouflaged targets or occluded faces not included in the training data are known to also cause high misclassification rates.
A pattern classification approach, that is relatively seldom mentioned in the pattern recognition literature, is the correlation matrix memories, or CMMs, which have been studied essentially in the neural networks community (T. Kohonen, Self-Organization and Associative Memory, second edition, Springer-Verlag, 1988; R. Hecht-Nielsen, Neurocomputing, Addison-Wesley, 1990; Branko Soucek and The Iris Group, Fuzzy, Holographic, and Parallel Intelligence—The Sixth-Generation Breakthrough, edited, John Wiley and Sons, 1992; James A. Anderson, An Introduction to Neural Networks, The MIT Press, 1995; S. Y. Kung, Digital Neural Networks, Pearson Education POD, 1997; D. J. Willshaw, P. P. Buneman and H. C. Longet-Higgins, “Non-holographic associative memory,” Nature, 222, pp. 960-962, 1969; D. J. Willshaw and H. C. Longet-Higgins, “Associative memory models,” Machine Intelligence, vol. 5, edited by B. Meltzer & O. Michie, Edinburgh University Press, 1970; K. Nagano, “Association—a model of associative memory,” IEEE Transactions on Systems, Man and Cybernetics, vol. SMC-2, pp. 68-70, 1972; G. Palm, “On associative memory,” Biological Cybernetics, vol. 36, pp. 19-31, 1980; E. Gardner, “The space of interactions in neural network models,” Journal of Physics, vol. A21, pp. 257-270, 1988; S. Amari, “Characteristics of sparsely encoded associative memory,” Neural Networks, vol. 2(6), pp. 451-457, 1989; J. Buckingham and D. Willshaw, “On setting unit thresholds in an incompletely connected associative net,” Network, vol. 4, pp. 441-459, 1993; M. Turner and J. Austin, “Matching Performance of Binary Correlation Matrix Memories,” Neural Networks, 1997). The training of CMMs, which are associative memories, is easy and fast even if they have a very high dimensional input. If new training data is to be learned or if the dimensionality of a trained CMM is to be modified, the CMM is not discarded, but can be easily updated or expanded.
Two types of CMM are noteworthy. They are the holographic neural nets (John Sutherland, “Artificial neural device utilizing phase orientation in the complex number domain to encode and decode stimulus response patterns,” U.S. Pat. No. 5,214,745, May 25, 1993; John Sutherland, “Neural networks,” U.S. Pat. No. 5,515,477, May 7, 1996) and the binary CMMs in the aforementioned papers by Willshaw and Longuet-Higgins (1970), Palm (1980), Gardner (1988), S. Amari (1989), M. Turner and J. Austin (1997), and the references therein.
The main idea of holographic neural nets (HNets) is representing real numbers by phase angle orientations on a complex number plane through the use of a sigmoidal transformation such as a hyperbolic tangent function. After each component of the input stimuli and output responses is converted into a complex number whose phase angle orientation (i.e. argument) represents the component, the correlation matrix is constructed in the standard manner. A holographic neural cell comprises essentially such a correlation matrix. If the dimensionality of the stimulus is large enough, augmented if necessary, and if the phase angle orientations of the stimuli and responses are more or less statistically independent and uniformly distributed on the unit circles in the complex number plane, the “signal part” in the response to an input stimulus is hopefully much greater than the “interference part” in the response to the same input stimulus during its retrieval because of self-destruction of those stored stimuli that are out of phase with said input stimulus like the self-destruction of a random walk on the complex number plane. This idea allows more stimuli to be stored in a complex correlation matrix than does the earlier versions of the correlation matrix.
However, the holographic neural cell approach suffers from the following shortcomings. First, to avoid ambiguity at the point, (−1, 0)=−1+0i, in the complex plane, a neighborhood of (−1, 0) must be excluded in the range of the sigmoidal transformation. This prevents the mentioned uniform distribution required for good self-destruction of the interference part. Second, it is not clear how to augment the stimuli without introducing much correlations among the stimuli, which again may reduce self-destruction of the interference part. Third, the argument of a complex number on the unit circle ranges from −π to π. To pack more stimuli on it, better self-destruction of the interference part is needed, which in turn requires a higher dimensionality of the stimuli. Such a higher dimensionality means a higher dimensionality of the correlation matrix, requiring more memory space to hold the matrix.
Binary CMMs have feature vectors encoded either into unipolar binary vectors with components equal to 1 or 0 or into bipolar binary vectors with components equal to 1 or −1. Bipolar binary vectors were used in most of the earlier work on binary CMMs. Superiority of sparse unipolar binary encoding (with most of the components of encoded feature vectors being 0 and only a few being 1) to nonsparse unipolar binary encoding and bipolar binary encoding was remarked and proved in the mentioned papers by Willshaw and Longuet-Higgins (1970), Palm (1980), Gardner (1988), and S. Amari (1989). Sparsely encoded CMMs are easy to implement (J. Austin and J. Kennedy, “A hardware implementation of a binary neural network,” MicroNeuro, IEEE Computer Press, 1994), and have found many applications. Nevertheless, sparsely encoded CMMs have quite a few shortcomings: (a) A large sparse correlation matrix has very low “information density” and takes much memory space. (b) A multistage sparsely encoded CMMs is often necessary. (c) There is no systematic way to determine the dimensionality of the sparse unipolar binary vectors to represent the feature vectors. (d) There is no systematic way to determine the number of stages or the number of neurons in each stage in a multistage sparsely encoded CMM. (e) There is no systematic way to determine whether a sparsely encoded CMM has a minimum misclassification probability for the given CMM architecture. (f) The mapping from the feature vectors to their sparse binary vectors representations must be stored in some memory space, further reducing the overall memory density of the CMM.
Judging from the foregoing shortcomings of the commonly used pattern classifiers, the holographic neural nets, and the sparsely encoded CMMs, there remains a need for alternatives to existing pattern classifiers in the prior art for recognizing patterns.
In this invention disclosure, a cortex-like learning machine, called a probabilistic associative memory (PAM), is disclosed that processes feature vectors or sequence of feature vectors, each feature vector being a ternary feature vector.
A PAM is a network of processing units (PUs). It can be viewed as a new neural network paradigm or a new type of learning machine. Each PU generates a representation of a subjective probability distribution of the label of a feature subvector or a sequence of feature subvectors that are received by the PU. Some PUs convert such representations into ternary vectors, which are included in feature subvectors input to other PUs. Weights in a PU learn an input feature subvector with or without supervision by a Hebb rule of learning. Some advantages of PAMs are the following:                1. As opposed to most of commonly used pattern recognizers, a PAM generalizes not by only a single holistic similarity criterion for the entire input exogenous feature vector, which noise, erasure, distortion and occlusion can easily defeat, but by a large number of similarity criteria for feature subvectors input to a large number of PUs (processing units) in different layers. These criteria contribute individually and collectively to generalization for single and multiple causes. Example 1: smiling; putting on a hat; growing or shaving beard; or wearing a wig can upset a single similarity criterion used for recognizing a face in a mug-shot photograph. However, a face can be recognized by each of a large number of feature subvectors of the face. If one of them is recognized to belong to a certain face, the face is recognized. Example 2: a typical kitchen contains a refrigerator, a counter top, sinks, faucets, stoves, fruit and vegetable on a table, etc. The kitchen is still a kitchen if a couple of items, say the stoves and the table with fruit and vegetable, are removed.        2. Masking matrices in a PU eliminate effects of corrupted ternary components of the feature subvector input to the PU, and thereby enable maximal generalization capability of the PU, and in turn that of the PAM.        3. PAMs are neural networks, but are no more blackboxes with “fully connected” layers much criticized by opponents of such neural networks as multilayer perceptrons (MLPs) and recurrent MLPs, whose weights are iteratively determined through minimizing an error criterion and have no interpretation in the context of their applications. In a PU of a PAM, weights are correlations between orthogonal expansions of subvectors of the PU's input feature subvectors and the labels of these feature subvectors. Each PU has a receptive region in the exogenous feature vector input to the PAM and classifies any cause within the receptive region. Such interpretations can be used to help select the architecture (i.e., layers, PUs, connections, feedback structures, etc.) of a PAM for the application.        4. The weights in each PU of a PAM learn by a Hebb rule and thus the PAM has a “photographic memory.” No iterative optimization such as that involved in local-search training methods using backpropagation or backpropagation through time is needed for learning. This allows easy learning of a large number of large exogenous feature vectors in reasonable time as well as easy online adaptive learning.        5. A PU can learn with or without supervision. This allows a PAM to (1) perform unsupervised deep learning in lower layers and supervised learning in higher layers; (2) perform supervised learning when a label is provided from outside the PAM and unsupervised learning when not; and (3) perform autonomous learning.        6. A PAM may have some capability of recognizing rotated, translated and scaled patterns. Moreover, easy learning and retrieving by a PAM allow it to learn translated, rotated and scaled versions of an input image with ease.        7. PUs generate representations of probability distributions of the labels of their input feature subvectors. Such representations of probability distributions of a common label can be combined into a single representation of probability distributions of the common label.        8. PAMs with hierarchical and feedback structures can detect and recognize multiple and hierarchical causes in a spatial or temporal exogenous feature vector.        9. The weight matrices (e.g., expansion correlation matrices) in different PUs can be added to combine the learned knowledge at virtually no additional cost.        10. The architecture of a PAM can be adjusted without discarding learned knowledge in the PAM. This allows enlargement of the feature subvectors, increase of the number of layers, and even increase of feedback connections.        11. Only a small number of algorithmic steps of parallel computing are needed for retrieval, which are suitable for massive parallelization at the bit level and by VLSI implememtation.        