The present invention relates generally to automated recognition systems for classifying and identifying patterns as objects within a library of objects, and more specifically to a recognition system including feed forward, feed back multiple neural networks with context driven recognition.
As it is generally known, preprocessing forms an integral part of many existing artificial recognition systems. In such systems, preprocessing converts or transforms an input signal into a more suitable form for further processing. Some common steps performed by existing recognition systems during preprocessing of an input signal include normalization, noise reduction (filtering), and feature extraction. There are various reasons for using preprocessing, including the fact that input signals often contain noise, and that preprocessing can sometimes effectively eliminate irrelevant information. Furthermore, performing data reduction during preprocessing may result in at least two advantages for neural network based recognition systems: a) dimensionality of the input vectors is reduced, affording better generalization properties, and b) training is generally much simpler and faster when smaller data sets are used. However, obtaining these advantages by preprocessing in some cases may introduce certain potential problems, since preprocessing may sometimes result in the loss of important information from the input signal.
For example, a standard procedure for noise removal in two-dimensional images is generally known as xe2x80x9csmoothingxe2x80x9d of the input signal. One of the simplest ways to smooth an image includes convolving it with a mask of a fixed size and setting the value of every pixel within the mask to the average value of the pixels within the mask. The smoothing process can be controlled by a set of parameters, one of which is the size of the mask. The size of the mask may vary from one pixel (when there is no smoothing) to the size of the whole image. Varying the value of this mask size parameter from low to high results in changing the resolution of the image from sharp to very blurry. It is, however, very difficult to correctly determine the value of the mask size parameter prior to recognition. This is because image recognition may be impaired if the resulting image resolution is too poor. For example, in the case of handwriting recognition, omission of a noise-like structure, such as a dot above xe2x80x9cixe2x80x9d or xe2x80x9cjxe2x80x9d, due to poor image resolution, could impair recognition.
One reason why some of the problems encountered at the preprocessing level are not possible to resolve during preprocessing is that some information which is necessary for choosing optimal preprocessing parameters is not yet available. Accordingly, it would be desirable to have a recognition system that provides interaction between higher (cognitive) level processing and the preprocessor in order to change the parameters controlling the preprocessor according to dynamically determined, higher level expectations.
U.S. Pat. No. 4,760,604 of Cooper et al. discloses a recognition system that may use multiple preprocessors. In that system, each of a number of adaptive modules can facilitate a different preprocessing scheme, but the preprocessing is done xe2x80x9cgloballyxe2x80x9d on the whole input pattern. It would be desirable, however, to use feedback information for a) locating a section or portion of the input pattern that needs additional preprocessing, and b) changing the preprocessing of only such a section or portion. This would be especially useful during recognition of complex objects where a specific type of xe2x80x9cre-preprocessingxe2x80x9d appropriate for one section could improve recognition of that particular section, but may have an adverse effect on recognition of the rest of the object if applied globally. Moreover, in order to selectively apply different preprocessing techniques to different regions of an object, a recognition system should be able to appropriately segment an object into parts. It would be desirable, therefore, to have a system which provides an appropriate, segmented representation of an object for the purpose of applying different preprocessing techniques at different regions of the object.
An effective recognition system should further be capable of recognizing a pattern regardless of its position and/or size within the overall context in which it is received. For example, the position of an address written on an envelope or a package can vary significantly, and the sizes of individual letters and numbers within the address are also not fixed. In many existing recognition systems, this problem is addressed at the preprocessing stage. In such existing systems, prior to a recognition stage, the received image is xe2x80x9ccleanedxe2x80x9d, in that the text is located, and surrounded by a rectangle that is then re-scaled or normalized. However, this approach suffers from significant drawbacks since at the level of preprocessing it is not yet known which sections of the input signal represent text or speech, and which sections represent background. Therefore, the output of such existing systems often consists of numerous false segmentations. It would be desirable to have a recognition system that does not rely on (prior to recognition) pre-segmentation of the input signal.
In addition, it would be desirable to provide a translationally invariant representation of an input object, within a system which also provides scale invariant recognition. The translationally invariant representation would permit an object to be described with the same values regardless of its position or location within the input signal. Such scale invariant recognition would allow recognition of an object that may appear in different scales or sizes, while the system need only be trained on one scale.
Another challenging problem in the field of pattern recognition is the identification and classification of objects that are only partially present within the input signal, or that are partially occluded for some reason. It would further be desirable to provide a recognition system that allows for the recognition of incomplete or partially occluded patterns as parts of a recognizable object.
One of the problems in the field of sequence analysis, as occurs in speech or cursive writing recognition systems, is referred to as the segmentation/binding dilemma. This problem stems from the fact that in order to unambiguously segment an input word into letters, the input word must be known, but in order to recognize the word, its constituent letters must be known. An existing approach to solving this problem (e.g. in cursive recognition applications), is to first make all possible segmentations, and then to choose an xe2x80x9coptimalxe2x80x9d one, with respect to the set of potentially recognizable objects, according to some predetermined criterion or cost function. Since this method is often computationally intensive, it is desirable to use an algorithm that can efficiently search the space of potentially recognizable objects. To this end, researchers have often used some variation of conventional dynamic programming optimization techniques. However, in some cases, dynamic programming techniques do not decrease the computational complexity of selecting an optimal segmentation. Accordingly, it would be desirable to have a method for recognizing an object which permits features of the object to be individually recognized and associated with the object, based on context information of some kind. Such a method should advantageously lend itself to a high degree of parallelism in its implementation, thus resulting in fast recognition results obtained in relatively few cycles. Such a method could, in some cases, advantageously provide an alternative to the dynamic programming based post-processing employed in existing systems to find an optimal segmentation of an input pattern.
Another problem related to sequence analysis in recognition systems is selecting a convenient representation of an input pattern that captures the sequential nature of the input signal. One technique used in existing recognition systems is based on what are generally referred to as xe2x80x9cHidden Markov Modelxe2x80x9d (HMM) algorithms. Although HMM algorithms have many useful properties, they only provide a global characterization of the input pattern. For example, in a handwriting recognition system, an HMM algorithm might provide only a global characterization of the input pattern equal to the probability that the complete pattern represents a certain dictionary word. However, such global characterizations are sometimes limited, and it would be desirable, therefore, to provide and employ descriptions of the input pattern other than a global characterization of the pattern during the recognition process. Moreover, systems based on HMM algorithms are generally not easily extensible to the analysis of two-dimensional signals. It would therefore be desirable to have a recognition system that, in contrast, can be easily extended to analysis of two-dimensional signals, as in image recognition such as face recognition and/or vehicle identification applications.
In summary, and in view of the various deficiencies and shortcomings of existing systems, it would be desirable to have a recognition system which provides a representation of an object in terms of its constituent parts, provides a translationally invariant representation of the object, and which provides scale invariant recognition. The system should further provide effective recognition of some patterns that are partially present in the input signal, or that are partially occluded, and also provide a suitable representation for sequences within the input signal, such that sequential ordering is preserved. Additionally, the system should provide a procedure/algorithm, based on dynamically determined, context based expectations, for identifying individual features/parts of an object to be. recognized. The system should be computationally efficient, and capable of implementation in a highly parallelized embodiment, while providing an information processing system that utilizes the interaction between a higher (cognitive) processing level and various lower level modules. Furthermore, it would be desirable to provide a mechanism for improving the preprocessing of individual sections of an input pattern, either by applying one or more preprocessors selected from a set of several preprocessors, or by changing the parameters within a single preprocessor.
In accordance with the present invention, a recognition system is disclosed which may be applied to recognition of a variety of input signals, including handwriting, speech, and/or visual data. While the disclosed system is generally suitable for the recognition of one-dimensional sequences, such as speech or cursive writing, it is not restricted to one-dimensional problems and is also applicable to two dimensional image analyses, including face recognition and/or vehicle identification systems.
With reference to the disclosed system, the first stage of the recognition process is referred to as preprocessing. The preprocessing stage performs appropriate preprocessing functions, for example normalizing and filtering an input signal , S=(s1, s2, . . . , sM), and transforms the input signal into a feature vector, F=(f1, f2, . . . , fR). The feature vector is then presented to a number of detection units. The detection units detect parts of a number of potentially recognizable objects within the input signal. Recognizable objects may be, for example, words, faces, or vehicles, individual ones of which may be identified by their respective features. For example, in a handwriting recognition embodiment of the disclosed system, each detected part of a recognizable word may correspond to a xe2x80x9cletterxe2x80x9d, and each detection unit may be referred to as a xe2x80x9cletter detectorxe2x80x9d, operating on a section of the feature vector corresponding to a letter from a predetermined alphabet. Detection units are positioned over the feature vector, F=(f1, f2, . . . , fR), so as to completely cover it, such that each feature in the feature vector is processed by at least one detection unit. Adjacent detection units may, in some circumstances, have overlapping receptive fields. In an illustrative handwriting recognition embodiment of the disclosed system, in which the number of letters in the alphabet is K, there are K detection units, each for detecting a different letter, which receive their input from the same section of the feature vector. As a result, each section of the feature vector can have K different interpretations depending on the activation levels of the detection units positioned over it. A structure reflecting the complete set of letter detector outputs across the complete feature vector is referred to as the detection matrix. Each element of the detection matrix reflects an activation level of a detection unit for one letter in the alphabet, at a particular location within the feature vector. An activation level for a particular letter is a value representing the probability that the letter has been detected. Accordingly, each detection matrix element contains a value indicating the probability that a particular letter from the alphabet has been detected at particular location within the feature vector. The process of transforming the feature vector into the detection matrix by the detection units is referred to as segmentation, and accordingly the set of detection units form at least a part of what is referred to as the segmentation network.
Following segmentation, the next stage of the recognition process is referred to as postprocessing. Postprocessing includes selecting a set of letters from the detection matrix, such that, according to some predetermined criteria or cost function, the selected set of letters represents a word from a predetermined word set, sometimes referred to as a xe2x80x9cdictionaryxe2x80x9d. This selection of letters representing a dictionary word is also referred to as xe2x80x9cbindingxe2x80x9d the letters to the word. The remainder of the components in the disclosed system, including what are referred to as simple units, complex units, word units, the decision module and the selective attention module are referred to as the binding network, and are employed as part of the postprocessing stage of the recognition process.
The relative position of the various units in the binding network, with respect to the position of the detection units of the segmentation network, is adjustable. This relative positioning may be adjusted in various ways, including the following two approaches: either the binding network is kept fixed and the position of the detection units in the segmentation network is changed by the selective attention module, or the position of the detection units in the segmentation network is fixed and the position of the binding network is changed. In the illustrative embodiments disclosed herein the segmentation network is fixed and the position of the binding network is changed, however the invention is not limited to such an implementation, and may alternatively be embodied with the binding network in a fixed position and providing adjustment of the relative position of the segmentation network.
The binding network of the disclosed system selects a subset of the detection matrix elements corresponding to letters of a recognizable word in the dictionary. The number of detection matrix elements selected by the binding network is advantageously equal to the number of letters in the recognizable dictionary word, not to the number of columns of the detection matrix, in contrast to some existing systems, such as those employing HMM algorithms. For example, in the case where an input pattern represents the word xe2x80x9ccatxe2x80x9d, many elements of the detection matrix may have high activation values, such as the elements representing the letters xe2x80x9cc, a, txe2x80x9d, and elements representing visually similar letters such as xe2x80x9co, l, e, n, u, i.xe2x80x9d In this example, the goal of the binding network is to select only three elements from the detection matrix, namely those representing the letters xe2x80x9ccxe2x80x9d, xe2x80x9caxe2x80x9d and xe2x80x9ctxe2x80x9d, and to discard or suppress all other elements from the detection matrix.
In the beginning of the disclosed binding procedure, an element from the detection matrix, for example one with the highest activation value, is selected as the xe2x80x9ccentral letterxe2x80x9d. The location of the selected central letter within the detection matrix determines a view of the input pattern to be employed by the recognition system. One or more words from the dictionary are then associated with the input pattern. For example, if the letter xe2x80x9ccxe2x80x9d is selected as the central letter, the words xe2x80x9cactxe2x80x9d, xe2x80x9cicexe2x80x9d and xe2x80x9caccountxe2x80x9d might be associated with the input pattern since they all contain the letter xe2x80x9ccxe2x80x9d. The dictionary word that is most strongly associated with the pattern, for example the word xe2x80x9cactxe2x80x9d, reflects high level, contextual expectations regarding the structure of the input pattern in the sense that the context of the word xe2x80x9cactxe2x80x9d requires that the letter xe2x80x9caxe2x80x9d should be found in a certain region to the left of the central letter xe2x80x9ccxe2x80x9d, and the letter xe2x80x9ctxe2x80x9d should be found in a certain region to the right of the letter xe2x80x9ccxe2x80x9d. Accordingly, instead of trying to find all possible sequences from the detection matrix that represent recognizable objects, the recognition system now advantageously employs an active process of looking for certain letters at certain expected locations. The binding network proposes a xe2x80x9ctentativexe2x80x9d segmentation of the input pattern, meaning that it selects some elements from the detection matrix that satisfy high level expectations (in terms of their locations) and which are detected with high confidence. We call the letters forming the tentative segmentation the xe2x80x9cselected lettersxe2x80x9d. The next step is to verify if the tentative segmentation is correct. For example, if the input pattern represents the word xe2x80x9cactxe2x80x9d, and if the selected central letter is the letter xe2x80x9ccxe2x80x9d, then the tentative segmentation might be xe2x80x9ca-c-txe2x80x9d, and in this particular case it would be the correct segmentation.
In order to determine whether the tentative segmentation is correct, the binding network then selects an element from the detection matrix as the next central letter, and referred to as the target letter. The target letter is one of the letters from the tentative segmentation, for example the letter xe2x80x9ctxe2x80x9d. The binding network is then repositioned over detection matrix based on the position of the target letter within the detection matrix. One of the goals of repositioning the binding network is to verify whether the previously selected letters, from the tentative segmentation, are at their expected locations. If the selected letters are within expected regions, then the next target letter is selected. Otherwise, if the selected letters are not within expected regions, some of the letters from the tentative segmentation are dismissed as candidates and the binding network generates a new tentative segmentation. The binding process is terminated once the location of every selected letter is determined to be within an expected region with respect to all other selected letters.
In addition to selecting a set of elements from the detection matrix, the disclosed system may selectively modify preprocessing parameters in accordance with feedback reflecting dynamically determined, higher level expectations, such as location or temporal expectations with respect to various components of a recognizable object, such as letters of a recognizable word in a handwriting recognition system. Significantly, these changes to the preprocessing parameter values are performed locally with respect to one or portions of the detection matrix, meaning that the changed values may be used for xe2x80x9cre-preprocessingxe2x80x9d certain regions of the input pattern. This is useful, and often necessary, during recognition of complex objects where changing the preprocessing of one section of the input pattern improves recognition of that section but has an adverse effect on recognition of the rest of the object.
Further in the disclosed system, in the beginning of the recognition process, default values for various specific recognition threshold parameters, such as an edge presence threshold for an edge detector, are used for preprocessing of the whole input pattern. If there is confusion among top ranked words (xe2x80x9ccandidate wordsxe2x80x9d), or if the top ranked word is not recognized with acceptable confidence, the decision module initiates the re-preprocessing of one or more sections of the input pattern. Only one word at a time is selected for re- preprocessing. If a location estimate and/or detection estimate of some of the selected letters is below a predetermined detection threshold, the decision module varies the parameters that were used in generating the estimates in question until the best estimate values, for each of the selected letters, is obtained. It is important to note that such parameters are modified in a controlled way, meaning that: a) the values of the parameters are restricted to vary only within certain predetermined, permissible intervals, and b) the parameters are modified until the best possible recognition is achieved for the given input pattern.
The disclosed system may advantageously be embodied within a highly parallel implementation, thus providing relatively fast recognition results. The disclosed system introduces an alternative to dynamic programming based post-processing for finding an optimal segmentation of an input pattern. The disclosed system is a working neural network-based system that employs context information to segment, modify and organize bottom up information in order to achieve accurate recognition results. In addition to providing a global characterization of the input pattern, the disclosed system may also provide one or more local characterizations of the input pattern.