The present invention is directed to methods and systems for pattern analysis using neural networks and, more particularly, to methods and systems for pattern analysis using neural networks having an increased resolution input field with less network interconnections.
Various techniques have been applied to the problem of distinguishing between a set of patterns invariant to changes in the position, size or angular orientation of the patterns. These techniques include statistical, symbolic, optical and neural network techniques.
The statistical, symbolic, and optical techniques are based on a two-step process of feature extraction followed by classification. For the feature extraction step, the system designer is required to specify a set of attributes capable of separating a set of training patterns into subgroups containing all distorted (i.e., translated, scaled and/or in-plane rotated) views of each distinct pattern. The system then organizes these features and uses them to classify incoming patterns.
There are at least three major disadvantages of these two-step approaches:
(1) It is not always obvious which features are sufficient for separating the set of training patterns such that all distorted views of a pattern will be classified as belonging to the same group.
(2) These approaches require a fairly large, if not exhaustive, set of training patterns to correctly organize the features such that novel views of the patterns will be correctly classified.
(3) The training time increases as the number of features and the training set size increase. Thus, these systems tend to be very slow.
A different approach to the problem of distortion invariant pattern recognition uses neural networks. Unlike the methods discussed above, in the neural network approach, the system is provided only with a set of distorted views of a set of distinct patterns (i.e., a set of translated, scaled, and/or in-plane rotated views of each distinct pattern) and, through training, learns what the relevant features are as well as how to distinguish between the distinct patterns.
Multi-layer, first-order neural networks using the backward error propagation (backprop) algorithm for training have been shown to be effective for distortion invariant pattern recognition. Using this method, the neural network is provided with a large set of distorted views of a set of patterns. The neural network weights are then adjusted using the back propagation learning rule such that the neural network correctly classifies a specified percentage of the training set patterns. The major disadvantages of this system are:
(1) The training set needs to be large enough and fairly indicative of the expected distortions so that the neural network can generalize rather than memorize what features to look for.
(2) The training time increases with the size of the training set and thus these systems are also fairly slow.
Furthermore, these first order neural networks achieve only 80%-90% recognition accuracy.
Progress in higher-order neural networks (HONNs) has been more promising. Reid et al. (M. B. Reid, L. Spirkovska, and E. Ochoa, "Simultaneous Position, Scale, and Rotation Invariant Pattern Classification Using Third-Order Neural Networks", Int. J. of Neural Networks, 1, 1989, pp. 154-159; and M. B. Reid, L. Spirkovska, and E. Ochoa, "Rapid Training of Higher-Order Neural Networks for Invariant Pattern Recognition", Proc. of Joint Int. Conf. on Neural Networks, Wash., D.C., Jun. 18-22, 1989, vol. 1, pp. 689-692, the disclosures of which are incorporated herein by reference in their entireties) have demonstrated that a third-order neural network is capable of achieving 100% accuracy in distinguishing between two patterns in a 9.times.9 pixel input field regardless of position, scale or in-plane rotation changes. The network needed to be trained on only one view of each object, and required only 10 to 20 passes to learn to distinguish between the objects in any in-plane rotational orientation, scale, or translated position. Thus, for pattern recognition, HONNs are superior to multi-layered first-order backprop trained networks in terms of training time, training set size and accuracy.
As an example, the use of a HONN for recognizing two-dimensional views of objects will first be discussed. FIG. 1A is a view of an object 20 (the space shuttle orbiter) in a two-dimensional input field 30. FIG. 1B is a view of object 20 after it has been translated across input field 30. FIG. 1C is a view of object 20 after it has been reduced in size (scaled) in input field 30. FIG. 1D is a view of object 20 after it has been rotated in-plane in input field 30. The output of an output node, denoted by y.sub.i, for output node i in a general HONN is given by: EQU y.sub.i =.THETA.(.SIGMA..sub.j w.sub.ij x.sub.j +.SIGMA..sub.j .SIGMA..sub.k w.sub.ijk x.sub.j x.sub.k +.SIGMA..sub.j .SIGMA..sub.k .SIGMA..sub.l w.sub.ijkl x.sub.j x.sub.k x.sub.l +. . . ) (1)
where .THETA.(f) is a non-linear threshold function such as, for example, the hard limiting transfer function given by: EQU y.sub.i =1, if f&gt;0, (2) EQU y.sub.i =0, otherwise;
the lower case x's are the excitation values of the input nodes; and the interconnection matrix elements, w, determine the weight that each input is given in the summation.
Using information about relationships expected between the input nodes under various distortions, the interconnection weights can be constrained such that invariance to given distortions is built directly into the network architecture. See Giles et al. (G. L. Giles and T. Maxwell, "Learning, Invariances, and Generalization in High-Order Neural Networks", Applied Optics, 26, 1987, pp. 4972-4978; and G. L. Giles, R. D. Griffin and T. Maxwell, "Encoding Geometric Invariances in Higher-Order Neural Networks", Neural Information Processing Systems, American Institute of Physics Conference Proceedings, 1988, pp. 301-309, the disclosures of which are incorporated herein by reference in their entireties) for a discussion of building invariance into HONNs.
As an example, in a second-order neural network 40 as illustrated in FIG. 2, the inputs (x.sub.1 -x.sub.4) are first combined in pairs at product points 42 (denoted by an X) to determine intermediate values, the intermediate values are weighted and summed at summation point 44, and then the output from output node y.sub.i is determined from the weighted sum of these intermediate values (i.e., the value determined at summation point 44) by applying the threshold function to the value determined at summation point 44. In accordance with equation (1) above, the output for a strictly second-order network is given by the function: EQU y.sub.i =.THETA. (.SIGMA..sub.j .SIGMA..sub.k w.sub.ijk x.sub.j x.sub.k). (3)
The invariances achieved using this architecture depend on the constraints placed on the weights.
In an example, each pair of input pixels combined in a second-order network define a line with a certain slope. As shown in FIGS. 3A and 3B, when an object 21 is moved (translated) or scaled in an input field 30, the two points in the same relative positions within the object still form the end points of a line having the same slope. Thus, provided that all pairs of points which define the same slope are connected to the output node using the same weight, the network will be invariant to distortions in scale and translation. In particular, for two pairs of pixels (j, k) and (l, m), with coordinates (x.sub.j, y.sub.j), (x.sub.k, y.sub.k), (x.sub.l, y.sub.l), and (x.sub.m, y.sub.m) respectively, the weights are constrained according to: EQU w.sub.ijk =w.sub.ilm, if (y.sub.k -y.sub.j)/ (x.sub.k -x.sub.j)=(y.sub.m -y.sub.l)/(x.sub.m -x.sub.l). (4)
Alternatively, the pair of points combined in a second-order network may define a distance. As shown in FIGS. 4A and 4B, when an object 22 is moved (translated) across input field 30 or rotated within a plane, the distance between a pair of points in the same relative positions on the object does not change. Thus, as long as all pairs of points which are separated by equal distances are connected to the output with the same weight, the network will be invariant to translation and in-plane rotation distortions. The weights for this set of invariances are constrained according to: EQU w.sub.ijk =w.sub.ilm, if .vertline..vertline.d.sub.jk .vertline..vertline.=.vertline..vertline.d.sub.lm .vertline..vertline.. (5)
That is, the magnitude of the vector defined by pixels j and k (d.sub.jk) is equal to the magnitude of the vector defined by pixels l and m (d.sub.lm).
Thus, when invariance to translation and scale (without invariance to rotation) or to translation and rotation (without invariance to scale) is desired, a second order neural network is appropriate.
To achieve invariance to translation, scale, and in-plane rotation simultaneously, a third order neural network 60, as shown in FIG. 5, can be used. The third order neural network 60 illustrated in FIG. 5 includes input nodes x.sub.1 -x.sub.4, connected in triplets to product points 62 (which are similar to product points 42 in the second-order neural network of FIG. 2 except that the excitation values of three input nodes are multiplied thereat), where intermediate values are determined. The intermediate values determined at product points 62 are weighted and summed at summation point 64, and the summation is supplied to a single output node y.sub.i.
The output for a strictly third-order neural network shown in FIG. 5, in accordance with equation (1) is given by the function: EQU y.sub.i =.THETA. (.SIGMA..sub.j .SIGMA..sub.k .SIGMA..sub.l w.sub.ijkl x.sub.j x.sub.k x.sub.l). (6)
That is, when the input field 30 is a matrix of pixels, as is commonly used for object recognition, all sets of input pixel triplets in object 24 are used to form triangles having included angles (.alpha., .beta., .gamma.). As shown in FIGS. 6A and 6B, when object 24 is translated, scaled, or rotated in-plane, the three points in the same relative positions on the object 24 still form the included angles (.alpha., .beta., .gamma.). In order to achieve invariance to all three distortions, all sets of triplets forming similar triangles are connected to the output node of the neural network with the same weight. That is, the weight for the triplet of inputs (j, k, l) is constrained to be a function of the associated included angles (.alpha., .beta., .gamma.) such that all elements of the alternating group on three elements are equal: EQU w.sub.ijkl =w.sub.(i,.alpha.,.beta.,.gamma.) =w.sub.(i,.beta.,.gamma., .alpha.) =w.sub.(i,.gamma.,.alpha.,.beta.). (7)
Note that the order of the angles matters, but not which angle is measured first.
Because HONNs are capable of providing non-linear separation using only a single layer, once invariances are incorporated into the architecture, the neural network can be trained (i.e., values assigned to the weights) using a simple rule of the form: EQU .DELTA.w.sub.ijk =(t.sub.i-y.sub.i) x.sub.j x.sub.k, (8)
for a second-order neural network, or EQU .DELTA.w.sub.ijkl =(t.sub.i -y.sub.i) x.sub.j x.sub.k x.sub.l, (9)
for a third-order neural network, where the expected training output, t, the actual output, y, and the inputs, x, are all binary. Prior to training, the weights, w, can be set to 0, or some other random number.
Second and third order neural networks as described above are disclosed in the above incorporated references of Reid et al.
The main advantage of building invariance to geometric distortions directly into the architecture of the HONN is that the network is forced to treat all distorted views of an object as the same object. Distortion invariance is achieved before any input vectors (training patterns) are presented to the network. Thus, the network needs to learn to distinguish between just one view of each object, not numerous distorted views of each object.
While building invariances into the network greatly reduces the number of independent weights which must be learned, some storage must still be used to associate each triplet of inputs to a set of included angles.
A disadvantage of HONNs is that as their order and the number of input nodes increases, the number of interconnections required (i.e., interconnections between the input nodes, x.sub.1-n and the product points 42 or 62) becomes excessive. For example, a network with M inputs and one output using rth order terms requires M-choose-r interconnections. For higher orders, this number, which is on the order of M.sup.r is clearly excessive.
In the field of two-dimensional object recognition, for example, wherein an N.times.N pixel input field is used, combinations of three pixels (i.e., in a third order neural network) can be chosen in N.sup.2 -choose-3 ways. Thus, for a 9.times.9 pixel input field, the number of possible triplet combinations (for a third-order neural network) is 81-choose-3 or 85,320. Increasing the resolution to 128.times.128 pixels increases the number of possible interconnections to 128.sup.2 -choose-3 or 7.3.times.10.sup.11, a number too great to store on most machines. For example, on a Sun 3/60 with 30 MB of swap space, a maximum of 5.6 million (integer) interconnections can be stored, limiting the input field size for fully connected third-order neural networks to about 18.times.18 pixels. Furthermore, the number of interconnections required to fully connect a 128.times.128 pixel input field (about 10.sup.12) is far too large to allow a parallel implementation in any hardware technology that will be commonly available in the foreseeable future.
Spirkovska et al. (L. Spirkovska and M. B. Reid, "Connectivity Strategies for Higher-Order Neural Networks Applied to Pattern Recognition", Int. Joint Conf. on Neural Networks, June, 1990, Vol. 1, pp. 21-26, the disclosure of which is incorporated herein by reference in its entirety) discusses techniques for reducing the number of interconnections in a HONN, so that the number of input nodes can be increased. In particular, regional connectivity was evaluated, in which triplets of pixels are connected to the output node only if the distances between all of the pixels comprising the triplet fell within a set of preselected regions. Using this strategy, the input field size was increased to 64.times.64 while still retaining many of the advantages shown previously, such as a small number of training passes, training on only one view of each object, and successful recognition invariant to in-plane rotation and translation.
However, using regional connectivity, images invariant to changes in scale could not be recognized. Also, as the input field size increased, the amount of time for each pass on a sequential machine increased dramatically. The 64.times.64 pixel input field network required on the order of days on a Sun 3/60 to learn to distinguish between two objects. This is despite the fact that the number of interconnections was greatly reduced from the fully connected version. The number of logical comparisons required to determine whether the distances between pixels fall within the preselected regions was still huge.