The invention is directed to neural networks, and more particularly to the training of a neural network to recognize a target or a pattern, or to otherwise evaluate data.
Neural networks are parallel processing structures consisting of nonlinear processing elements or nodes interconnected by fixed or variable weights. A node sums n weighted inputs and passes the result through a nonlinear function. A node is characterized by the amount of an internal threshold, and by the type of nonlinearity used. More complex nodes may contain local memory, temporal integrators, or more complex mathematical operators. These topologies can be specified by differential equations that typically consist of memory decay terms, feedforward terms, and/or feedback terms and can be constructed to generate arbitrarily complex decision regions for stimulus-response pairs; hence they are well suited for use as detectors and classifiers.
Classic pattern recognitior algorithms (e.g. detection, classification, target recognition) require assumptions concerning the underlying statistics of the environment. Neural networks, on the other hand, are non-parametric and can effectively address a broad class of problems as is described, for example, ir R. P. Lippman, "An Introduction to Computing with Neural Nets" IEEE ASSP Magazine, pages 4-22, April 1987. Further, neural networks have an intrinsic fault tolerance. Some "neurons" may fail and yet the overall network can still perform well because information is distributed across all of the elements of the networks (see, for example, Rumelhart and McClelland, "Parallel Distributed Processing," Vol. I, MIT Press, Cambridge, Mass., pages 423-443, 472-486 (1986)). This is not possible in strictly Von Neumann architectures.
Neural network paradigms can be divided into two categories: supervised learning and unsupervised learning. In supervised learning, with which we are concerned here, input data is associated with some output criterion in a one-to-one mapping, with this mapping known a priori. The mapping is then learned by the network in a training phase. Future inputs which are similar to those in the training sample will be classified appropriately.
Multiple layer perceptrons, a type cf neural network also known as a feedforward network, are typically used in supervised learning applications. Each computation node sums n weighted inputs, subtracts a threshold value (bias term) and passes the result through a logistic function. An appropriate choice of logistic function provides a basis for global stability of these architectures. Single layer perceptrons (i.e., feedforward networks consisting of a single input layer) define decision regions separated by a hyperplane. If inputs from given different data classes are linearly separable, a hyperplane can be defined between the classes by adjusting the values of the weights and bias terms. If the inputs are not linearly separable, containing overlapping distributions, a least mean square (LMS) solution is typically generated to minimize the mean squared error between the calculated output of the network and the actual desired output.
Two layer perceptrons (i.e., neural networks with a single hidden layer of processing elements) can define unbounded, arbitrary convex polytopes in the hyperspace spanned by the inputs. These regions are generated by the intersections of multiple hyperplanes and have at most as many sides as there are nodes in the hidden layer.
Three layer perceptrons can form arbitrarily complex decision regions. No more than three layers of elements in perceptron networks are necessary to solve arbitrary classification mapping problems (see A. N. Kolmogorov, "On the Representation of Continuous Functions of Many Variables by Superposition of Continuous Functions of One Function and Addition", Dokl Akad Navk SSSR, Vol. 14, pages 953-956, 1957).
Both continuous valued inputs and continuous valued outputs may be implemented, allowing for a wide range of input types and output categories. The inputs received by the respective nodes of the input layer define a vector containing the input feature values to be studied. These may consist of state-space components, frequency components, pixel values, transform coefficients, or any other features considered important and representative cf sample data contents to be learned.
Given a network architecture, a training set of input patterns, and the corresponding target output values (desired output values), every set of weight values and bias values defines the output of the network for each presented pattern. The error between the actual output of the network and the target or desired output value defines a response surface over an N-dimensional hyperspace, where there are N weights and bias terms to be adapted. Training cf a multi-layered network can be achieved through a backpropagation algorithm (see, for example the above mentioned Rumelhart and McClelland text), which implements a gradient search over the error response surface for the set of weight values which minimizes the sum of the squared error between the actual and target output values. A backpropagation algorithm which purports to accelerate the training relative to the traditional techniques is described in U.S. Pat. Nos. 4,912,649, 4,912,652, 4,912,654, and 4,912,655, issued to Wood, and U.S. Pat. No. 4,912,651 issued to Wood et al. Another backpropagation algorithm for use with a particular neural network is described in U.S. Pat. No. 4,918,618 issued to Tomlinson, Jr.
However, the differential equations and their associated stability functions defining the neural network generate energy surfaces that may contain many local optima, so that the error response surface may contain many corresponding local minima that may be far removed from the globally optimum solution. A gradient technique such as a backpropagation algorithm may lead to entrapment in these suboptimal solutions so that the network inaccurately classifies input patterns. For example, in the approach taught by the Tomlinson patent, single changes in a bit structure representing weight values of a neural network are made after each run with test data, and the process looks at one parameter at a time. As a result, the process solutions (sets of weight values) take directions of steepest descent toward locally best solutions with respect to individual parameters and a globally best solution is unlikely to be identified.
One strategy to avoid the problem of local optima is simply to restart the optimization with a new random set of weight values, in the hope that a different optimum will be found. Of course, there is no guarantee that such a minimal energy well will not also be a local solution. Another technique is to perturb the weight values whenever the algorithm seems to be in a local minimum point and then continue training, but this does not guarantee that the same local solution will not be rediscovered (see, for example the above-mentioned Rumelhart and McClelland text). Further, should the response surface be pocked with many local optima, the constant modification of the weight values may make the gradient search technique ineffective at finding even "good" locally optimal solutions. If additional nodes are added to the network until the training algorithm discovers a suitable solution, the resulting network may be severely overdefined. Any training data can be correctly classified if the network is given sufficient degrees of freedom. However, such a network is unlikely to perform well on new data taken independently from the training data.
Simulated annealing has been used with some success at overcoming local optima is but the required execution time, high because, among other reasons only one proposed solution can be considered at a time making this an unsatisfactory approach to many problems. Better solutions are always kept and worse solutions are retained with a probability which is an exponential function of the degradation D and a "temperature" T which starts at a high level and becomes progressively lower, and may be expressed as exp(-D/T). A difficulty with this approach is that there is no reliable way to select a starting point of the "temperature" and its rate of decline. See, for example, the above-mentioned Rumelhart and McClelland text. Teaching an annealing approach to training a neural network, which favors changes in solutions that are in the direction of the most recent improvement, is U.S. Pat. No. 4,933,871 to DeSieno. This annealing approach is also characterized by considering solutions (sets of weight values) one at a time and further characterized by always retaining the "best" solution as a starting point for change until a "better" solution is discovered. As a result, while permitting locally optimal solutions to be overcome, the process is slow to investigate a wide ranging variety of solutions and can easily be delayed in such local solutions.
A "genetic" algorithm for training a neural network, which is intended to provide a near global optimum solution, has been described in Montana and Davis, "Training Feedforward Neural Networks Using Genetic Algorithms", Eleventh International Joint Conference on Artificial Intelligence (1989). The algorithm creates new solutions (sets of weight values), normally coded as a string of bits or real numbers, by combining two parents, i.e. selecting bits from one or the other to produce progeny. The relative number of uses of particular parents to produce progeny is an exponential function of their relative accuracy or fitness in classifying training patterns. Thus, there is a rapid convergence on locally optimal solutions. However, since between two solutions, the better one is always preferred and thus poorer solutions are generally not retained, i.e. there is no probabilistic search for solutions, convergence on a locally optimal solution is possible, but a global convergence cannot be guaranteed without an additional probability of randomly flipping each individual bit. Also since the solutions are typically coded with strings of bits, the strings typically contain thousands of bits. Combinations of parents are typically performed by selecting a single cross-over point. This creates large jumps over the response surface, thereby making it difficult to fine tune the solutions.
The search for an appropriate set of weights and bias terms for a neural network is a complex, combinational optimization problem. No single parameter can be optimized without regard to all other parameters. Evolutionary programming has been used to address other difficult combinatorial optimization problems such as the traveling salesman problem. See, for example, D. B. Fogel, "An Evolutionary Approach to the Traveling Salesman Problem", Biol. Cybern., 60, pgs. 139-144 (1988). Evolutionary programming approaches to solving of a problem may be described as (1) taking a collection of solutions having some coding with a measurable worth, (2) perturbing the coding to obtain progeny in such a manner that the mean worth of the progeny is the same as that of the parent, e.g. perturbing the parent by adding values selected from a Gaussian distribution with a mean of zero, (3) comparing solutions and (4) probabilistically selecting which solutions are to be retained. The original evolutionary programming concept (see Fogel et al, Artificial Intelligence Through Simulated Evolution, John Wiley & Sons, 1966) focused on the problem of predicting any stationary or nonstationary time series with respect to an arbitrary payoff function, modeling an unknown transducer on the basis of input-output data, and optimally controlling an unknown system with respect to an arbitrary payoff function.
Natural evolution optimizes behavior through iterative mutation and selection within a class of coding structures. The evolutionary process is simulated in the following manner: an original population of "machines" (math logic functions arbitrarily chosen or given as "hints") are measured as to their individual ability to predict each next event in their "experience" with respect to whatever payoff function has been prescribed (e.g. squared error, absolute error, all-none, or another reasonable choice). Progeny are then created through random mutation of the parent machines. The progeny are scored on their predictive ability in a similar manner to their parents. Those "machines" which are most suitable for achieving the task at hand are probabilistically selected to become the new parents. An actual prediction is made when the predictive fit score demonstrates that a sufficient level of credibility has been achieved. The surviving machines generate a prediction, indicate the logic of this prediction, and become the progenitors for the next sequence of progeny, this in preparation for the next prediction. Thus, aspects of randomness are selectively incorporated into the surviving logics. The sequence of predictor machines demonstrates phyletic learning.