Information in input signals or observation data has to be handled in many applications, like computer vision or controlling of an autonomic robot and several different kinds of methods have been developed for this purpose.
For instance, several such methods have been presented in machine learning and in statistical analysis, which can be used to learn the representation of information, in other words parameters that define the representation of information are adapted on the basis of observation data. The purpose of adaptation is often to find such useful features from observations, which can be used for instance for pattern recognition, control or corresponding actions. For example independent component analysis is used to find descriptive features of input signals from observation data. Common for many of these methods is that, based on input vectors, an information processing unit forms feature vectors, which describe properties of input vectors and the meaning of the elements of the feature vectors, in other words the features, changes as a result of adaptation. As a state vector of a Kalman filter can be considered as a feature vector, the state vector describes information in input vectors.
Many of these above described methods can be interpreted in view of Bayesian statistics, where the probabilities describe a degree of belief and Bayes' rule specifies how to integrate information from different sources. For example when a value of a state vector is calculated in case of a Kalman filter, the information obtained from observation signals and previous values of state vectors is combined. In terms of Bayesian statistics: a prior distribution, which is calculated by using previous state vectors, and likelihood, which is calculated on the basis of observation signals, are combined to a posterior distribution of the state vector, which describes the probabilities of different state vector values. Posterior distribution follows Gaussian distribution in the case of the Kalman filter and it is represented by means of a posterior mean and covariance. In some other methods, the posterior uncertainty is not represented at all, but these methods can nevertheless be interpreted as approximations of the Bayesian method.
Worth noticing in the methods is that the value of a feature vector describes an observation signal, but the information is also integrated from prior information.
Pattern recognition is a part of machine learning based on information aims to develop models or systems that recognize patterns. Pattern recognition is applied e.g. in information technology and robotics but also in medical technology and research of human computer interaction. A pattern recognition system is defined as a process with four steps—measurement, preprocessing, feature extraction and classification.
In the first step, the data needed is acquired, mainly by measuring physical variables and by converting obtained analogical data in digital form. In the second step, data are preprocessed, often by using different kinds of digital signal processing methods, like filtering or principal component analysis. In the third step, preprocessed measurement data is mapped into the feature space. In this step, the data can be seen to be converted into information. In the fourth step, the samples mapped into the feature space, are categorized into two or more categories by using a classifier.
In demanding pattern recognition tasks, like handwritten character recognition, processing of feature vectors is organized hierarchically often in such a way that several processing units exist on each level and they get feature vectors as input from a part of the units of lower levels of hierarchy. In such systems, processing units can, as a context, get features from higher units and possibly also from units on the same level. An advantage of using hierarchy is that the number of connections remains low, but it is anyway able to integrate information from a large amount of input signals. A classic example of such a hierarchical model is Neocognitron by Fukushima (Fukushima, 1980).
In many applications, one faces a situation, in which the observation data contains more statistical structure than the processing unit used is able to represent. In such a case, it is necessary to select information that is useful to the task. It is often necessary to adapt the model in such a way that the features it represents describe information useful to the task.
In some cases, a supervised learning, which is a machine learning method, is suitable for a controlled selection of useful features. Controlled learning corresponds to regression in statistics. The goal of the regression is to find a mapping from one input vector to another one, called a target vector. If the mapping from input vectors to target vectors is done in several steps, the intermediate results can be interpreted as feature vectors especially if the structure of the model is chosen in such a way that the dimension of the intermediate result is smaller than that of the input and target vectors (Hecht-Nielsen, 1993). In the simplest case, two sequential linear mappings can be used, the result being essentially the same as in canonical correlation analysis. The input vector can be interpreted as a primary input and the target vector as a context, because the feature vector, in which above described supervised learning methods are utilized, depends immediately on the input vector, but during the teaching, the representation is modified in such a way that the features contain as much information from the target vector as possible.
A problem with these methods is that a huge amount of observation data are needed for the supervised learning of multi-step mapping, making learning of many everyday problems unpractically slow or impossible when adequate observation data are lacking. Typically, one tries to solve the problem by using some supervised learning method as a preprocessing, for example the above mentioned independent component analysis. In such a solution, the problem is that supervised learning only aims to describe the observation data, but it does not try to select information useful to the task. One solution presented is e.g. the finding of time invariant features. However, this very often does not limit the set of potentially useful features to a sufficient extent and on the other hand useful information, which does not happen to be time-invariant, may be lost.
Basically, the problem is that methods based on Bayesian statistical inference (as the hierarchical temporal memory described in the publications WO2006063291 and US2007005531) are able to represent the degree of belief and the phenomena underline the observations, but the selection of useful information is by nature a decision theoretical problem. In addition to the degree of belief, it is important to consider the utility, as it is called in decision theory. The probability of the feature answers the question, whether the feature is present in the observations, but also the utility, which is obtained if the existing feature is present, needs to be considered.
Methods that solely use statistical inference are especially unsuitable for situations, wherein useful information has to be selected dynamically. In many everyday problems—for example in the control of robot-hand grasping—it is important to be able to select information relevant to the current task and the control problem is significantly easier if only useful information is represented (for instance the position of the hand, the place and shape of the target object and possible obstacles between the hand and the object) instead of all the information given by the sensors (cameras, microphones and so on).
The cortex of human beings and other mammals can solve the types of problems described above. The cortex is able to integrate enormous amounts of information given by the senses, to select the parts essential to the task (selective attention, dynamical selection) and to modify its representations in a direction useful to different kinds of tasks (learning, modification of parameters). The development of many methods resemble those described in this publication, has therefore been inspired by the function of cortex of human beings and other mammals.
Selective attention has been shown to emerge from systems that consist of several information processing units, if every unit selects the information it represents on the basis of what information other units transmit to it as context (for example Usher and Niebur, 1996). Selective attention emerges if the units aim to select, from their own primary input, the information that fit the context. Because it mostly is useful to alternate the target of attention, also such a mechanism is needed that prevents the attention focus on the same thing for too long. Many methods for this purpose exist (e.g. Lankheet, 2006).
It is known that selective attention in humans plays a central role for learning associations between features (Kruschke, 2001) and also for the supervising of the learning of presentations. In previous models, this has, however, either not been implemented at all or then the targets of the attention have been determined in some way in advance. Walther and Koch (2006) used the latter method. Their model is unable to adapt to new types of objects, instead the attention and learning only work for objects which fulfills such criterions of so called proto-objects that are set in advance.
The aim of the so called Principal components analysis (PCA) is to find those components, by means of which the central features multidimensional data can be represented from without loosing important. The principal components analysis orders the components of the input data according to their eigen values.
After the principle components analysis, it is necessary to select which can be discarded as less important, because the method does not automatically discard any components, it only sets the components found in an order of magnitude. It is one of the most central methods in pattern recognition and signal processing.
In the method presented in the article: “A neurodynamical cortical model of visual attention and invariant object recognition”, Vision Research 44 (2004) 621-642, by Deco and Rolls (2004), the dynamic selection takes place on the basis of context. Pre-activations are calculated for the features solely on the basis of “primary input”. After that, the pre-activations of the features, predicted from the context, are strengthened. Finally the strongest activations are selected. Thus, the context affects which features will be selected for representation. However, a separate method was used for the adaptation of the parameters presented by Wallis, Rolls and Földiák. No real context is used in this method. (G. Wallis, E. T. Rolls and P. Földiák, “Learning invariant responses to the natural transformations of objects”. International Joint Conference on Neural Networks, 2:1087-1090, 1993.) So in the article of Rolls and Deco, a method was used in which the parameters were first adapted by one method and then the obtained parameters were used in dynamical selection of feature information based on predictions made from the context.
Canonical correlation analysis takes the context into consideration in the adaptation of the parameters, but information is not selected dynamically.
The following publications are presented as prior art in addition to those mentioned above.    K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position”. Biological Cybernetics, 36:193-202, 1980.    R. Hecht-Nielsen, “Replicator neural networks for universal optimal source coding”. Science, 269:1860-1863, 1995.    J. K. Kruschke, “Toward a unified model of attention in associative learning”. Journal of Mathematical Psychology 45:812-863, 2001.    M. J. M. Lankheet, “Unraveling adaptation and mutual inhibition in perceptual rivalry”. Journal of Vision, 6:304-310, 2006.    M. Usher and E. Niebur, “Modeling the temporal dynamics of IT neurons in visual search: A mechanism for top-down selective attention”. Journal of cognitive neuroscience, 8:311-327, 1996.    G. Wallis, E. T. Rolls and P. Földiák, “Learning invariant responses to the natural transformations of objects”. International Joint Conference on Neural Networks, 2:1087-1090, 1993. Wallis, Rolls and Földiák adapt parameters, but they use timely invariance of produced features as a criterion for the adaptation. So context is not used at all.    D. Walther and C. Koch, “Modeling attention to salient proto-objects”. Neural Networks, 19:1395-1407, 2006.
Because learning of the features and their dynamical selection, are separate processes, information which is obtained by selective attention can not be utilized in the learning.
The object of the method of this application is to combine adaptation of parameters that define the features with dynamical selection of feature information in such a way that the method both learns and selects features useful for the task.