1. Field of the Invention
The present invention relates to a learning method and apparatus for learning weights of a neural network by using learning patterns, to be used for pattern recognition, data compression and the like. The present invention relates also to a simulator using a neural network.
2. Description of the Related Art
A neural network is a simplified network modeled after a human brain having a great number of nerve cell neurons coupled by synapses which allow signals to pass uni-directionally. Signals are transmitted to neurons via synapses. Various information processing can be made by properly adjusting the resistances of synapses, i.e., weights of synapses. Each neuron receives outputs from preceding neurons via weighted synapses, and the sum of received outputs is modified by a non-linear response function to output it to other succeeding neurons.
The structure of a neural network includes two typical types, i.e., a totally interconnected type and a multi-layer type. The former type is suited for the solution of optimization problems, and the latter is suited for the solution of recognition problems such as pattern recognition problems.
In using a multi-layer type neural network for pattern recognition and data compression, it is necessary to provide learning patterns composed of pairs of input and output patterns and learn in advance weights of the network prior to using it, in order that the network can output correct signals. For this purpose, an optimization method called back-propagation is generally used. This method is discussed, for example, in "Parallel Distributed Processing", pp. 322 to 331, D. E. Runelhart, 1988.
The characteristic features of this method reside in that a continuous non-linear function such as a sigmoid function is used as a non-linear neuron response function, and that weights are renewed by using the following expression (1) to reduce an output error of a network. EQU dw/dt=-.differential.E/.differential.w (1)
where
w: weight, PA1 E: output error of network, and PA1 t: time PA1 net.sub.pm.sup.k : net input value to m-th neuron at k-th layer with input pattern P, PA1 W.sub.mj.sup.k : coupling weight between m-th neuron at k-th layer and j-th neuron at k-1-th neuron, PA1 n.sub.k : number of neurons at k-th layer, PA1 n.sub.k-1 : number of neurons at k-1-th layer, and PA1 f: sigmoid function. PA1 O.sub.pj.sup.l : output value of j-th neuron at output layer (l-th layer ) with input pattern P, PA1 t.sub.pj : desired output signal at j-th neuron at output layer (l-th layer) with input pattern P, and PA1 n.sub.l : number of neurons at output layer (l-th layer) . PA1 .alpha.: moment coefficient, and PA1 i: number of learning times. PA1 w.sub.mj.sup.k : coupling weight between m-th neuron at k-th layer and j-th neuron at k-1-th layer, PA1 n.sub.k : number of neurons at k-the layer, and PA1 l: number of neuron layers.
It has been pointed out that the back-propagation learning method takes a lot of time in learning. This method has a large dependency on parameters, so the learning time sometimes takes ten to tens of thousands times longer than that when optimum parameters are used.
As a means for solving such a problem, a non-linear optimization method has been used for learning. One of such methods prevailing in documents and the like may be a conjugate gradient method which requires less calculation because it uses only first-order differentiation, and which provides high speed learning.
Prior to describing the conjugate gradient method, there will be first described with reference to FIG. 1 the procedure of back-propagation on which the conjugate gradient method relies.
First, learning data 10 shown in FIG. 1 will be described. The learning data 10 is composed of input signals and desired output signals such as shown in FIGS. 2A and 2B. The number of input signals is (the number of input layer neurons) multiplied by (the number of learning patterns), and the number of desired output signals is (the number of output layer neurons) multiplied by (the number of learning patterns). Each input signal pattern is related to each corresponding desired output signal pattern. An input signal pattern 1 for example is inputted to the input layer, and an error of an output pattern at the output layer relative to a desired output signal pattern is calculated and used for learning.
At step 35, the pattern of input signals of the learning data 10 is inputted to input layer neurons.
At step 36, the input signals are sequentially propagated to output neurons, and a final output value of output layer neurons is obtained by using the following expression (2): ##EQU1## where O.sub.pm.sup.k : output value of m-th neuron at k-th layer with input pattern P,
As seen from the expression (2), a net input to each neuron is a sum of inner products between the weights of all synapses coupled to the neuron and the outputs to the synapses. In the following, a set of weights of all synapses coupled to each neuron is called a weight vector, and the magnitude of the weight vector is called a norm. The norm is defined by the following expression (3): ##EQU2## The expression (3) gives the norm of an i-th neuron at a k-th layer.
At step 37, there is calculated an output error Ep of the outputs at the output layer neurons calculated at the step 36, relative to the desired output signals of the learning data 10, by using the following expression (4): ##EQU3## where l: number of layers of neural network,
At step 38, for the input pattern P the steepest descent gradient (hereinafter simply called a gradient) relative to neuron weights is calculated using the following expression: ##EQU4## f'(net.sub.pm.sup.k) : differentiated sigmoid function.
At step 39, a new or present weight correction quantity .DELTA.w.sub.mj.sup.k (i) is obtained from the calculated gradient and the previous correction quantity .DELTA.w.sub.jm.sup.k (i-1), using the following expression (6): ##EQU5## where .eta.: learning coefficient,
The second term at the right side of the expression (6) is called a moment term which is empirically added to the expression (6) to speed up learning.
At step 40, the processes from steps 35 to 39 are repeated for all input patterns.
At step 41, the step 40 is repeated until the output error Ep becomes below a predetermined upper error limit being used as a standard for learning completion.
The principle of learning through back-propagation has been described above.
Next, the conjugate gradient method will be described. The conjugate gradient method is a hill climbing method in line searching under the condition that an error surface in a weight space can be locally approximated by a two-dimensional surface.
A conjugate direction will first be explained with reference to FIG. 3A showing ellipses each representing a contour line f(x)=xQx=d on an error surface of an objective function in the weight space expressed by a two-dimensional, second-order function. A cross point between an optional line I and a contour line f(x)=c.sub.1 is indicated by a. Similarly, a cross point between a line L parallel with the line I and a contour line f(x)=c.sub.2 is indicated by b. A minimum point m of the objective function is somewhere on a line J passing through the points a and b. This is attributed to the nature of ellipses representing contour lines on the error surface expressed by the second-order objective function. The lines I and J are said that they are in the conjugate direction. It has been proved that a minimum point of an n-dimensional, second-order function can be located when line search is performed n times from the initial point in the conjugate direction.
The conjugate gradient method is a method of searching a minimum point of the objective function while generating a conjugate vector (direction) using a gradient direction of the objective function. From the previous conjugate direction d.vertline.(i-1) obtained in line searching and the gradient direction-fx(x(i)) of the objective function, the next conjugate vector (direction) d.vertline.(i) for line searching is obtained by using the following expressions (7) and (8): EQU d.vertline.(i)=-f.sub.x (x(i))+.beta.(i).multidot.d.vertline.(i-1)(7) ##EQU6##
Search starts first in the steepest descent direction from a search start point which is an initial starting point in line search. This procedure is found discussed, for example, in "Theory and Computational Methods in Optimization", Hideaki Kanoh, pp.77 to 85, 1987, Korona Publishing Co., Ltd.
The steepest descent direction (gradient) is a direction perpendicular to a tangent line to a contour line at a search start point in a weight space, such as shown in FIG. 4.
The above-described minimum point search method can be used not only for a second-order function, but also for other general functions. In the latter case, searching n times will no more ensure convergence to a minimum point, and so further search is required. However, if searches are performed more than n times, improper search direction occurs since the second-order function is assumed for the generation of the conjugate direction, resulting in poor convergence. It is therefore necessary to resume searches in the steepest descent direction after n calculations.
If such an approach is applied to learning for a neural network, e.g., a multi-layer (l-layer) type neural network, generation of the conjugate direction d is given by the following expressions (9) and (10): ##EQU7## d.sub.mj.sup.k : coupling weight component in conjugate direction between m-th neuron at k-th layer and j-th neuron at k-1-th layer,
The error E is a sum of errors Ep for all learning patterns P as given by the following expression (11), different from the case of the expression (4): ##EQU8##
The learning algorithm for a neural network using a conventional conjugate gradient method will be described with reference to FIG. 5.
At step 55 shown in FIG. 5, initial values are assigned to the synaptic weights using a random number in order to start learning.
At step 56, an error gradient at a search start point in a weight space (coordinate system) is obtained from the following expression (12): ##EQU9##
The details of the step 56 are shown in FIG. 6.
At step 70 in FIG. 6, an input signal pattern of the learning data 10 is inputted to input layer neurons.
At step 71, the input signal pattern is propagated toward neurons on the output layer side, to obtain final outputs of the output layer neurons, by using the expression (2).
At step 72, an output error Ep is calculated using the expression (4), by using the desired outputs of the learning data 10 and the outputs of the output layer neurons calculated by the expression (4).
At step 73, for the input learning pattern P, a gradient relative to the neuron weights is calculated using expression (5).
At step 74, for the respective input learning patterns P, the gradients of neuron weights calculated at the step 73 are summed.
At step 75, the processes at the steps 70 to 74 are repeated for all learning patterns. In this manner, there is obtained a sum of all gradients of neuron weights for all learning patterns.
At the step 56 described above, the gradient given by the expression (12) can be obtained.
At steps 57, 58 and 59, a ratio .beta.(i) of the previous line search conjugate vector (direction) to be added to the gradient at the search start point of the present line search is calculated to generate the conjugate vector (direction) of the present line search. If the number i of calculations for obtaining the conjugate vector (direction) satisfies i mod n=0, i.e., if a remainder of i divided by n (i, n:integer) is 0, then .beta.(i) is set to 0 at step 58. If i mod n is not 0, .beta.(i) is calculated at step 59 using the expression (10). At the step 58, .beta.(i) is reset to 0 every n-th time (n is the number of all weights in a network and takes an integer value) because the line search is required to resume in the steepest descent direction every n-th time in order to prevent convergence from worsening. n represents the number of weights in a weight space, and corresponds to the above-described n-dimension.
At step 60 from the gradient ##EQU10## and the previous conjugate vector (direction) d.vertline.(i-1), the next conjugate vector (direction) d.vertline.(i) is obtained using the expression (9)
Obtained at step 61 is a step position .eta.(i) which is considered that it can minimize the error E in the conjugate direction upon execution of line searches in the conjugate direction at the search start point.
At step 62, weights w/ are renewed using the step position .eta.(i) obtained at the step 61.
The steps 56 to 62 are repeated until the output error converges below a predetermined upper error limit being used as a standard for learning completion.
As shown in FIG. 4, in the above description, line search is first performed in the steepest descent direction at the search start point .eta..sub.0 to obtain the error minimum point .eta..sub.(0), basing upon the first learning (i=0). Then, basing upon the second learning, line search is performed in the conjugate direction d.vertline.i+1 by using the point .eta..sub.(0) as the search start point .eta..sub.0. In this case, the previous conjugate vector (direction) d.vertline.i is the steepest descent direction, or gradient. Thereafter, line searches in the conjugate direction is repeated to obtain a global minimum error point.
The principle of learning by a conventional conjugate gradient method has been described above.
The principles of the conjugate gradient method and back-propagation are compared with reference to FIGS. 4 and 7. FIG. 7 is a diagram illustrating a search for a minimum value of a two-dimensional, second-order function by using back-propagation. This method is not efficient because searches are performed in the steepest descent direction at predetermined step lengths l.sub.1, L.sub.2, . . . In contrast with this method, in the case of the conjugate gradient method shown in FIG. 4, once the conjugate direction is found, a minimum error is present on a line in the conjugate direction. Therefore, by properly setting step lengths and performing line searches, it is possible to efficiently search the minimum point.