1. Field of the Invention
The present invention relates to a method of and a system for controlling learning of learning pattern data in a neural network data in pattern recognition, data compression, and the like.
2. Description of Related Art
A neural network is a simplified model representing the structure of a human brain and in the neural network neurons are coupled with each other via synapses, each of which transmits a signal in a single direction. By adjusting a resistance of each synapse, namely, a weight thereof, it is possible to accomplish various kinds of processing. Each of outputs from other neurons to a current neuron is weighted in accordance with a weight assigned to a corresponding synapse and the current neuron applies transformation of a non-linear response function to a sum of the weighted inputs to output a result of the transformation to a next neuron as a neuron output. That is, the learning of the neural network means adjusting weights of synapses.
The neural networks are structurally classified into two types, i.e., an interconnecting type and a multilayer type, which are respectively suitable for an optimization problem and processing such as a pattern recognition.
When a multilayer type of neural network is employed in pattern recognition and/or data compression, in order for the network to produce an appropriate answer to an unknown problem, it is necessary to cause the network to learn past cases in accordance with input and teacher signal patterns of learning pattern data. An optimization method for this purpose is called a back propagation method. This method has been discussed in a reference, e.g., Rumelhart D. E., Hinton G. E., Williams R. J., "Parallel Distributed Processing: Exploration in the Microstructure of Congnition", Volume I; Foundation CHAPTER 8, pp. 322-328, the MIT Press Cambridge, Mass. (1986). In this method, by utilizing a non-decreasing continuous function such as a sigmoid function as a non-linear response function of a neuron and by updating weights of synapses in accordance with the following equation (1), an output error of the network is gradually reduced. EQU dw/dt=-.differential.E/.differential.w (1)
w: Weight
E: Error of network output pattern
t: Time
The method will now be described in detail with reference to FIG. 6. Learning pattern data 10 generally includes input signal data and teacher signal data of the numbers of (number of neurons of input layer).times.(number of patterns) and (number of neurons of output layer).times.(number of patterns), respectively. When an input signal pattern is inputted to the input layer of the neural network, an error between a pattern outputted from the output layer thereof and a teacher signal patter corresponding to the input signal pattern is calculated and used for the learning in the neural network.
In a step 37, an input signal pattern p of the learning pattern data 10 is supplied to neurons of the input layer. In a step 38, according to the following equations (2) and (3), the input signal pattern is sequentially propagated toward neurons of the output layer to finally attain an output pattern from neurons of the output layer. ##EQU1## As can be understood from the above equations, in order to obtain a sum of inputs to a neuron, an inner product between an output from a previous neuron and a weight of a synapse corresponding to the previous neuron is computed for each previous neuron. In the following description, consider a vector having, as elements thereof, all weights respectively assigned to synapses connected to the neuron and call such a vector a weight vector and also call a magnitude of the weight vector a norm.
In a step 39, an output error Ep for the pattern p is calculated in accordance with the following equation (4) from an teacher signal pattern p for the input signal pattern p and the output pattern from the output layer neurons calculated in the step 38. ##EQU2## In a step 40, a gradient of the error E for each weight W of each neuron is calculated according to the following equations (5) and (6). ##EQU3## A current correction of a concerned weight is determined from the gradient of error and a previous correction thereof by use of the following equation (7). ##EQU4##
.eta.: Learning coefficient
.alpha.: Momentum coefficient
Here, the second term of the right side of equation (7) is called a momentum term and is added thereto in accordance with experiences to accelerate the learning.
In a step 41, the learning pattern is updated for preparation of the subsequent learning. In a step 42, a check is made to determine whether or not the learning has been completed over all the patterns of the learning pattern data 10. If all the patterns are completed, the control is passed to a step 43; otherwise, the control returns to the step 37 to repeatedly perform the learning. In a step 43, the number of iterations of the learning is updated by adding "1" thereto.
The above steps are repeatedly carried out until the output error becomes to be equal to or less than a predetermined upper-limit preset as a criterion of the completion of learning.
Description has been given of the principle of learning accomplished in the back propagation method. On the other hand, the above learning is attended with the following two problems: (1) Protraction of learning due to an excessive increase of weights and (2) Local minimum.
First, the problem (1) will be described. In the step 40, in order to obtain the gradient of the error for a weight, -.sigma.E/.sigma.W, a derivative f' of the sigmoid function f must be multiplied by the output O. When a sum of inputs to the neuron becomes to be excessively large, the derivative f' enters a flat region of its function and hence becomes to be very small. As a result, the gradient of error is minimized and hence the learning rarely advances in this situation. The increase in the sum of inputs, which causes the protraction of learning, takes place because of an excessive increase in the norms of the weight vectors of neurons in the learning process. The above phenomenon will be described with reference to FIGS. 15 and 16.
FIG. 15 shows dependency of the sigmoid function upon the weight parameter W. The following equations (8) and (9) indicate that the greater the weight W is, namely, the greater the norms of the weight vectors are, the steeper rising of the function value is in the proximity of O=0. EQU f (net)=1/(1+exp (-net)) (8) EQU net=w.o (9)
On the other hand, the value of df/d(net) employed in the learning is quickly reduced when the sum of inputs, net, is increased as shown in FIG. 16. Since the sum of inputs, net, is proportional to the weight W, when the weight W becomes greater, df/d(net) is reduced.
The phenomenon will be further described in detail with reference to FIGS. 17 and 18. These Figures show a state of a neuron having two-dimensional inputs before and after the learning when a linear separation of signals, a and b, is to be performed, respectively. Namely, when the signals, a and b, are inputted to the neuron after completion of the learning, the neuron outputs values of 0.9 and 0.1 for the signals a and b, respectively. The following equations (10) and (11) are used to calculate an output O3 from the neuron. EQU O.sub.3 =f (net)=1/(1+exp (-net)) (10) EQU net=w.sub.1 . o.sub.1 +w.sub.2 . o.sub.2 +.theta. (11)
FIG. 17 shows a state of the neuron prior to the learning in which U represents a straight line for net=0, a plane associated with which will be called a separating plane herebelow. In this state, the neuron outputs values 0.6 and 0.4 for the signals a and b, respectively. In order to establish the neuron state shown in FIG. 18 through the learning, it is necessary to rotate the straight line U for separation of the straight line U from the signals a and b to the possible extent. Also it is necessary to increase the norms of weight vectors to steeply raise the sigmoid function. However, if the norms become to be great in this processing before the straight line U is fully rotated, the value of df/d(net) involved in the form of a multiplication in the gradient descent of error is decreased, resulting in substantial protraction of the learning.
Next, description will be given of the problem (2), i.e., the local minimum. The back propagation method is an optimization method using a gradient descent and is also called a mountaineering method. Consequently, when a determination of an optimal solution is carried out according to the error and the gradient descent, the neural network is trapped in a local minimum and hence the learning cannot be completed in some cases.
As a result, in a learning system adopting the conventional back propagation method attended with the problems related to its principle, the overall period of time consumed for the learning is likely to be elongated in many cases from the following reasons:
When it is to be determined whether or not the learning is almost in protraction, the number of iterations of the learning and the output error are generally used. However, a reducing rate of the output error through the back propagation is not uniform and is remarkably altered when the number of iterations is increased. Therefore, it is difficult to appropriately determine the protraction of learning. As a result, in some cases an unsatisfactory result of the learning is obtained in spite of an excessively elongated time period of the learning.
To escape from the learning protraction, there have been known several restoring methods. However, determination materials are not sufficient for which of the methods is to be used. Therefore, an appropriate method cannot be selected in many cases. As a result, the learning cannot be correctly restored from the protraction or the restoration is delayed.