A multilayer perceptron trained by backpropagation is a widely used model of artificial and biological neural information processing. FIG. 1 is a diagrammatic representation of such a conventional multilayer perceptron (MLP). It includes an input layer 10 with I neurons 11 respectively having outputs x.sub.l, . . . , x.sub.i, . . . x.sub.I. A neuron is a processing unit which provides an output according to an activation function in response to an input. The activation function defines a relationship between the net input to the neuron and the output of the neuron. Input layer neurons are commonly buffers, where the activation function is the identity function. Other functions are also possible. In some cases the input layer neurons are a form of transducer which transforms some physical entity such as light, into a signal which may be processed by the network, such as an electrical signal. A hidden layer 12 has J neurons 13 with outputs Y.sub.l, . . . , Y.sub.j, . . . Y.sub.J. Each hidden layer neuron is connected to an input layer neuron i with a connection, commonly called a synapse 16, which has a connection weight w.sub.ji associated with it. An output layer 14 of K neurons 15 provides outputs z.sub.l, . . . , z.sub.k, . . . z.sub.K. Each output layer neuron k is connected to a hidden layer neuron j with a synapse 17 which has a connection weight w.sub.kj.
Each neuron is assumed to be bipolar (identified by double circles in FIG. 1), meaning the connection weight of any posterior synapse (i.e., a synapse connecting an output of the neuron to the input of another neuron) may be both positive or negative. (An anterior synapse of a neuron is connected between an input of the neuron and an output of another neuron). The activation function f for a neuron is nonlinear and is typically sigmoidal, typically with saturation ranges at 0 and 1.
Each synapse 16 between an input layer neuron 11 and a hidden layer neuron 13 connects the output of the input layer neuron to an input of the hidden layer neuron. The output of the input layer neuron is multiplied by the connection weight associated with the synapse 16. The hidden layer neuron has multiple inputs which are summed together to provide a net input which is subjected to its activation function. By applying the activation function to the net input of the neuron, the output of the neuron is determined.
A neuron may also have a bias input, which is a constant offset input, which also may have a corresponding weight. If the bias inputs are b.sub.zk and b.sub.yj, respectively for the output and hidden layers, then the net inputs .omega..sub.k and .psi..sub.j to an output layer neuron and a hidden layer neuron, respectively, are: ##EQU1##
To train a neural network to process information, a number of pairs of inputs x.sub.l, . . . x.sub.i, . . .x.sub.I and corresponding expected outputs (targets) z.sub.l .degree., . . . , z.sub.k .degree., . . . z.sub.K .degree., corresponding to each output neuron, are prepared. The set of pairs of target outputs and known inputs constitute training data. The actual output zk of an output neuron k as a result of an applied input is compared to the target output z.degree.k, corresponding to the applied input, by a comparator C.sub.k to obtain an output error e.sub.k. The output error is generally the difference between the target output and the actual output. Thus, the error e.sub.k in an output layer neuron z.sub.k is simply: EQU e.sub.k =z.sub.k .degree.-z.sub.k ( 3)
The error e.sub.j in a hidden layer neuron z.sub.k is obtained by backpropagating the output error along the corresponding feedforward paths and is: ##EQU2##
Based on the errors obtained, the connection weights in the network are adapted so as to minimize the error e.sub.k for all output neurons. In a conventional three-layer perceptron (FIG. 1), connection weights are typically adapted by steepest descent using the generalized delta rule (a well-known method, described, for example in Rumelhart, D. E., Hinton, G. E., and Williams R. J., "Learning internal representations by error propagation," Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1 (eds. Rumelhart, D. E., McClelland, J. L. and the PDP Research Group) pp. 318-362, MIT Press, Cambridge, Mass., 1986).
For the output layer neurons, the connection weight w.sub.kj for the synapse connecting the jth hidden layer neuron to the kth output layer neuron is adjusted by adding a value .DELTA.w.sub.kj defined by the following. EQU .DELTA.w.sub.kj =.eta..sub.z e.sub.k y.sub.j .multidot.f'(.omega..sub.k)(5)
where f' is the slope of the activation function, .omega..sub.k is the net input to the output layer neuron as defined above, e.sub.k is the error as defined above, .eta..sub.z is the learning rate, and y.sub.j is the net input to the hidden layer.
For the hidden layer neurons, the connection weight w.sub.ji for the synapse connecting the ith input layer neuron to the jth hidden layer neuron is adjusted by adding a value .DELTA.w.sub.ji, defined by the following: EQU .DELTA.w.sub.ji =.eta..sub.y e.sub.j x.sub.i .multidot.f'(.psi..sub.j)(6)
where .psi. is the net input to the hidden layer neuron as defined above, .eta..sub.y is the learning rate, e.sub.j is the error as defined above, and x.sub.i is the net input the input layer neuron.
As is evident from Equations 6 and 4, adaptation of the hidden layer neurons is based on a weighted sum of all output errors, because each hidden layer neuron drives many output neurons. Thus, backpropagation does not scale well, for large numbers of output neurons, and may fail or become prohibitive for large problems. Further, the error signal is therefore highly non-specific and contributes to crosstalk interference. That is, the backpropagated errors from different output layer neurons often happen to nullify one another making it difficult for neurons to adapt optimally.
Another problem caused by the requirement that each hidden layer neuron drive many output layer neurons is that the probability of convergence to local minima multiplies as the number of output layer neurons increases. (A local minimum is reached where the connection weights for synapses between the hidden layer neurons and output layer neurons cease to adapt over one training epoch even when there are nonzero output errors.) Let P.sub.k be the probability that the kth output layer neuron becomes trapped in a local minimum during any training session and q.sub.k =1-p.sub.k be the corresponding probability that a global minimum can be reached. For a multilayer network with K output layer neurons the probability that any or all of the output layer neurons are trapped in local minima is 1-.PI..sub.l.sup.k q.sub.k. For large K this probability tends to unity even if a global minimum does exist.
One way to avoid or to overcome the problem of local minima is to repeat the training process many times with randomly assigned initial values for the connection weights. Another, more elegant, approach is to employ a method called simulated annealing, which introduces some randomness into the adaptation, along with the gradient descent. Both approaches are computationally costly, requiring many repetitions, and in general do not guarantee global convergence.
Even in the absence of local minima, convergence of training using backpropagation may tend to be slow because of stationary points, i.e., when all neurons are in a range where the slope of the activation function f, i.e., f', is close to zero. In this event, adaptation is very slow even in the presence of large output errors.
Another problem with the backpropagation method, from a biological standpoint, is that it does not represent a plausible neural learning method because it cannot be implemented by realistic neurons. This follows from Equation 4, which requires that the output errors can be backpropagated (in retrograde) to the hidden layer neurons, each output error being weighted by the connection weights w.sub.kj along the same connection paths, thus requiring the hidden layer neurons to know the synaptic transmissibility downstream--a biologically implausible event. Furthermore, the variations in the backpropagated errors to different hidden layer neurons imply the existence of an extensive recurrent feedback structure that is absent in most nervous systems.
An electronic neuron faces the same physical limitations as does a biological neuron; for example, the connection weights are inaccessible downstream in the feedforward path. Consequently, the backpropagation method cannot be readily implemented, especially in a large-scale network, by electronic means, e.g., using analog or digital very large scale integrated (VLSI) circuits. Although some parallel feedback paths theoretically can be added to relay the connection weights, the resulting electronic architecture is often impractical for large networks. Thus, it is more common to implement the backpropagation method numerically on digital computers.
There are other learning rules which may be used and that are biologically more realistic than backpropagation. For example, there is the adaptive reward-penalty (A.sub.r-p) learning rule of Barto and Jordan. (Barto, A. G. and Jordan, M. I., "Gradient following without backpropagation in layered networks," Proc. IEEE First Annual International Conference on Neural Networks, Vol. II: 629-636, San Diego, Calif., 1987). With this method a single reinforcement signal is broadcast to all neurons but the error is the average of all output errors and is therefore nonspecific for the purpose of training each output layer neuron and hidden layer neuron. However, the A.sub.r-p method generally converges more slowly than the backpropagation method. This and other methods may also require highly specialized neural connections for their implementation.
Training a multilayer network with backpropagation and other methods is often a cumbersome task that may not always be practicable. Although one of the most celebrated advantages of neural networks is the parallel distributed processing capability which enables fast computation during performance, this parallelism has yet to be used to enhance training.