A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
1. Field of the Invention
The present invention relates to a distributed parallel processing network of the form of a back propagation network having an input layer, an intermediate (or hidden) layer, and an output layer, and in particular, to such a network in which the interconnection weights between the input and hidden layers and between the hidden and the output layers are determined by the steady state solutions of the set of stiff differential equations that define the relationships between these respective layers.
2. Description of the Prior Art
A parallel distributed processing network is a computational circuit which is implemented as a multi-layered arrangement of interconnected processing elements. In a typical arrangement of a parallel distributed processing system, known as the back propagation network, the network architecture includes at least three layers of processing elements: a first, or "input", layer; a second, intermediate or "hidden", layer; and a third or "output", layer. Each processing element has an input and an output port, and operates on an input applied thereto to transform that input in accordance to a predetermined transfer function. The transfer function usually takes the form of a sigmoid function having a predetermined threshold associated therewith that is operable to produce an output bounded between zero and one.
The input port of each processing element in the input layer is connectible to at least one input signal while the output port of each of the processing elements in the input layer is connected to at least one, and typically to a plurality, of the processing elements in the intermediate layer. Similarly, the output port of each of the processing elements in the intermediate layer is connectible to at least one, and typically to a plurality, of the processing elements in the output layer.
The connection lines between the elements in the input layer and those in the intermediate layer and the connection lines between the elements in the intermediate layer and those in the output layer each have a "connection weight" associated therewith. Generally speaking, with respect to the elements in the hidden and the output layers, the magnitude of any signal at the input of a processing element in these layers is the summation over all of the inputs of the inner vector product of the activation strength of each input signal and the connection weight of the corresponding connection line on which it is carried plus the threshold of the processing element.
A network such as that described is trained using a learning rule known as the "Generalized Delta Rule" described in Rumelhart, Hinton and Williams, "Learning Internal Representations by Error Propagation", Parallel Distributed Processing, Volume 1, Foundations, Rumelhart and McClelland, editors, MIT Press, Cambridge, Mass. (1986). This rule is generalization of the "Delta Rule" given by Widrow, "Generalization and Information Storage in Networks of ADELINE Neurons", Self Organizing Systems, Yovitts, editor, Spartan Books, New York (1962).
Using such a training rule the network is presented sequentially with a number of "training patterns" relating the activation strengths of the input signals at the input elements to the corresponding values at the output ports of the elements in the output layer. All the connection weights are then varied together to minimize the error in the network's predicted outputs. To change the weights in accordance with the method of steepest descent the gradient of the error with respect to the weights is found and the weights changed to move the error toward smaller values. The weights are changed iteratively, in accordance with the following relationship: ##EQU1## where .DELTA.W[gd] represents the weight changes for gradient or steepest descent,
.DELTA.W.sup.n represents the weight change applied at iteration number n, PA1 .DELTA.W.sup.n-1 represents the weight change from the previous iteration, PA1 eta represents the "learning rate" of the network and PA1 alpha represents the "momentum" term.
Training occurs in discrete steps or iterations, each requiring one presentation of all the training patterns to the network. In order to assure convergence of the iterative algorithm the learning rate eta must be chosen small enough so that oscillations and instabilities do not occur. The momentum term alpha is provided to increase the convergence rate and to help avoid the problem of being trapped in local minima of the least-squares surface.
It is well known in the art that although the training rule gives a finite-step change in the weights that corresponds approximately to the direction of gradient or steepest descent, it does not always converge. Perhaps more problematic, the training is initially fairly rapid, but eventually training becomes extremely slow, especially as the network becomes more accurate in predicting the outputs. Training to great accuracy can take thousands or even millions of presentations of the training set. The number of training presentations required grows rapidly with the size of the network, usually more rapidly than the square of the number of connections.
During the IEEE International Conference on Neural Networks, San Diego, Calif., in July 24-27, 1988, a number of papers were presented on techniques to increase the training speed of a back propagation network. These papers are available in the proceedings of the conference as published for the IEEE by SOS Printing of San Diego, Calif. The paper by Kung and Howard, at pages 363 to 370, describes a method to predict the optimal number of hidden layer elements needed. By using only the minimum number, the training rate can be optimized. In the paper by Kollias and Anastassiou, at pages 383 to 390, the Marquhardt-Levenberg least squares optimization technique is used to improve convergence. The paper by Gelband and Tse, at pages 417 to 424, describes a method of generalizing the threshold logic function to a "selective" function thereby effectively reducing the multilayer network to a single layer and thus increasing the learning rate. The paper of Hush and Salas, at pages 441 to 447 describes a method called the "Gradient Reuse Algorithm" which speeds the training rate.
Accordingly, in view of the foregoing it is believed to be advantageous to provide a back propagation network which is more expeditiously trained.