1. Field of the Invention
The present invention is related to a neural network and particularly to a neural network which can learn with high precision and at a high speed.
2. Description of the Prior Art
Neural networks which perform signal operations in parallel using neurons that process signals in a similar manner to neural elements can be provided with many functions such as
(1) a pattern recognition function, PA0 (2) an interpolation function, and PA0 (3) predicting and optimizing multivariable functions,
and have attracted considerable attention as new information processing means not only in the manufacturing industry but also in a variety of fields such as medical science and economic forecast.
In a neural network of a hierarchical structure consisting of an input layer, a hidden layer and an output layer, a learning algorithm of back propagation using training data has been known.
The learning algorithm based on the conventional back propagation operation is described below.
FIG. 14 shows an example of the neural network of a three-layer structure, wherein the input layer consists of P neurons 100(p) (p=1.about.P), the hidden layer consists of Q neurons 200(q) (q=1.about.Q) and the output layer consists of R neurons 300(r) (r=1.about.R). In this example, P=4, Q=6 and R=2. The numbers of neurons included in the input and output layers depend on the number of input and output data, and the number of neurons of the hidden layer can arbitrarily be set.
FIG. 15 is a conceptual diagram showing a neuron (unit) which is an element of a neural network. Each neuron fetches data obtained by respectively multiplying a plurality of supplied data X(1).about.X(n) by each of weights W(1).about.W(n) as input data, and calculates a value "x" by subtracting a threshold value .theta. from the total sum .SIGMA.W(n)X(n) of them. And, as a transfer function f(x) with this "x" as a variable, a function which can be differentiated by "x" is used. A sigmoid function, a typical example of transfer function, is described in equation (1) in FIG. 24A.
The relationships between "x" and f(x) when coefficient "k" of the sigmoid function (hereinafter referred to as gradient "k") is assumed to be 0.1, 1.0 and 10 are shown in FIG. 16, and the differentiated value of the sigmoid function is shown in equation (2) in FIG. 24A.
With reference to FIG. 17, a learning algorithm by the conventional back propagation is now described. XO1(1), XO1(2), . . . XO1(P) are output data of the individual neurons 100(p) in the input layer, XO2(1), XO2(2), . . . XO2(Q) are output data of the individual neurons 200(q) in the hidden layer, XO3(1), XO3(2), . . . XO3(R) are output data of the individual neurons 300(r) in the output layer, XI2(1), XI2(3), . . . XI2(Q) are input data of the individual neurons 200(q) in the hidden layer, XI3(1), XI3(2), . . . XI3(R) are input data of the individual neurons 300(r) in the output layer, and T(1), T(2), . . . T(R) are training data.
W12 is weighting data for obtaining input data of the hidden layer from output data of the input layer, and W12(1,1) is weighting data when the output XO(1) is given to XI2(1,1) and WI2(2,2) is weighting data when the output XO1(2) is given to XI2(2) In general, W12(P,Q) is weighting data when XO1(P) is given to XI2(Q). Accordingly, although not shown, for instance, the weighting data when XO1(2) is given to XI2(1) is W12(2,1), and that when XO1(P) is given to XI2(1) is W12(p,1).
Similarly, W23 is weighting data for obtaining input data of the output layer from output data of the hidden layer, and in general, W23(Q,R) is weighting data when XO2(Q) is given to XI3(R). .theta.2(1), .theta.2(2), . . . .theta.2(Q) are threshold values of the hidden layer, and .theta.3(1), .theta.3(2), . . . .theta.3(R) are threshold values of the output layer.
XI2(1), input data of the hidden layer, is expressed by equation (3). EQU XI2(1)=XO1(1).times.W12(1,1)+XO1(2).times.W12(2,1) +. . . +XO1(P).times.W12(P,1) (3)
Similarly, XI2(Q) is expressed by equation (4). EQU XI2(Q)=XO1(1).times.W12(1,Q)+XO1(2).times.W12(2,Q) +. . . +XO1(P).times.W12(P,Q) (4)
XI3(1), input data of the output layer, is expressed by equation (5). EQU XI3(1)=XO2(1).times.W23(1,1)+XO2(2).times.W23(2,1) +. . . +XO2(Q).times.W23(Q,1) (5)
Similarly, XI3(R) is expressed by equation (6). EQU XI3(R)=XO2(1).times.W23(1,R)+XO2(2).times.W23(2,R) +. . . +XO2(Q).times.W23(Q,R) (6)
In the hidden layer, output data XO2(1) is obtained from input data XI2(1) by the equation (7). ##EQU1##
Similarly, the calculation for obtaining XO2(Q) from XI2(Q) is based on equation (8). EQU XO2(Q)=f{XI2(Q)-.theta.2(Q)} (8)
The calculation for obtaining XO3(1) from input data XI3(1)and the calculation for obtaining XO3(R) from input data XI3(R) are based on equations (9) and (10), respectively. EQU XO3(1)=f{XI3(1)-.theta.3(1)} (9) EQU XO3(R)=f{XI3(R)-.theta.3(R)} (10)
When output data XO3(r) (r=1.about.R) have been obtained from the R neurons constituting the output layer for data XO1(p) (p=1.about.P) given to the P neurons constituting the input layer, as described above, error data E are calculated on the basis of a predetermined cost function using the output data XO3(r) (r=1.about.R) and training data T(r) (r=1.about.R). When training data are used as in this example, such an error function as shown in equation (11) of FIG. 24A can be used as a cost function. Then, equations (12) to (15) are calculated using the error or cost function E.
Equations (12) to (15) of FIG. 24A represent partial differentiations of the error or cost function E by weighting variables W23(q,r), threshold variables .theta.3(r), weighting variables W12(p,q) and threshold variables .theta.2(q), respectively. That is, in equation (12), E is partially differentiated by all the combinations of W23(q,r) (all the combinations of q=1.about.Q and r=1.about.R, namely, W23(1,1), W23(2,1), . . . , W23(Q,1), W23(1,2), W23(2,2), . . . , W23(Q,2), . . . , W23(1,R), W23(2,R), . . . , W23(Q,R)). Similarly for equations (13) to (15), E is partially differentiated using all of the threshold variables .theta.3(r), weighting variables W12(p,q) and threshold variables .theta.2(q).
Then, the changing amount of current processing of each weighting data or threshold value .DELTA.W12(p,q)0, .DELTA..theta.2(q)0, .DELTA.W23(q,r)0 and .DELTA..theta.3(r)0 are determined from equations (16) to (19).
The various changing amounts in one time before (previous), two times before, . . . , and N times before precessings are discriminated by replacing the suffix 0 for the changing amount in current processing shown in equations (16) to (19) with 1.about.N, as shown in equation (20). Further, the various changing coefficients in current, one time before, two times before, . . . , and N times before processings are determined by an equation (21).
These changing coefficients .alpha.0.about..alpha.N and .beta.0.about..beta.N may be preset to any values. The accumulated changing amounts .DELTA.W12(p,q), .DELTA..theta.2(q), .DELTA.W23(q,r) and .DELTA..theta.3(r) used in current processing are calculated from equations (22) to (25).
The accumulated changing amounts .DELTA.W12(p,q), .DELTA..theta.2(q), .DELTA.W23(q,r) and .DELTA..theta.3(r) calculated in this way are added to the weighting data and threshold value data W12(p,q), .theta.2(q), W23(q,r) and .theta.3(r), respectively, to correct them.
If there are "A" sets of combinations of input data and training data, those processings are repeated "A" times on all sets of said combinations. With this, one-time learning operation is finished. Predetermined times of such learnings are subsequently executed.
For example, for the learning of a problem of exclusive-OR (XOR) as shown in Table 1 described at the end of the specification, the numbers of neurons are two for the input layer and one for the output layer, and the hidden layer may have any number of neurons. In this example, the sets of the combinations of input data and training data is four.
As mentioned above, in the conventional back propagation operation, only weighting data and threshold values were subject to change by learning. The above described learning process is further described with reference to FIG. 18.
FIG. 18 is a flowchart showing an example of the learning algorithm according to the traditional back propagation operation.
In step S1, an input data pattern is selected (in the example of Table 1, one of four sets of input data is selected), and the selected data is supplied to the input layer.
In step S2, predetermined calculations are performed in the input hidden and output layers using the input data. By this, data (calculation result) is output from the output layer.
In step S3, the output data is compared with the training data corresponding to the selected input data, and error data "E" is calculated by equation (11).
In step S4, accumulated changing amounts .DELTA.W12(p,q), .DELTA..theta.2(q), .DELTA.W23(q,r) and .DELTA..theta.3(r) are calculated by equations (22) to (25).
In step S5, the calculated accumulated changing amounts are added to W12(p,q), .theta.2(q), W23(q,r) and .theta.3(r), which are weighting data and threshold value data, respectively, thereby to correct them.
In step S6, it is determined whether or not all of the input patterns have been selected. If all of the input patterns have not yet been selected, the process returns to step S1, and the process moves to step S7 if they have all been selected.
In step S7, it is determined that one learning operation has been completed.
In step S8, it is determined whether or not the learning has been completed predetermined times, and the process returns to step S1 if it has not been completed and terminates if it has been completed. Alternatively, in step S8, it may by determined whether or not output data can be obtained with precision higher than a pre-determined value, that is, whether or not the error data "E" has become smaller than a predetermined one.
The functions of the above described neural network is described more specifically using FIG. 19 in which input data and training data as shown in Table 1 are stored in an input/training data memory means 11, for instance. A selection means 12 sequentially transfers input data from the memory means 11 to neurons 100(p) of the input layer, and simultaneously sequentially outputs training data corresponding to the transferred input data to error calculation means 13.
Weighting calculation means 51 multiplies each output signal XO1(p) of neurons 100(p) by weighting data W12(p,q) stored in weighting data memory means 21 and outputs the result to each neuron 200(q) of the hidden layer. Similarly, weighting calculation means 52 multiplies each output signal XO2(q) of neurons 200(q) by weighting data W23(q,r) stored in weighting data memory means 22 and outputs the result to each neuron 300(r) of the output layer.
Transfer function calculation means 61 performs the calculations shown in equations (7), (8), etc. in each neuron 200(q), using the input data and the threshold values .theta.2(q) stored in threshold value memory means 31. Similarly, transfer function calculation means 62 performs the calculations shown in equations (9), (10), etc. within each neuron 300(r), using the input data and the threshold values .theta.3(r) stored in threshold value memory means 32.
The error calculation means 13 calculates error data "E" by performing the calculation of equation (11) using the training data and the data output from the output layer.
Accumulated changing amount calculation means 20 uses error data "E" and weighting data W12(p,q) to calculate accumulated changing amount .DELTA.W12(p,q). In addition, using the error data "E" and weighting data W23(q,r), it calculates accumulated changing amount .DELTA.W23(q,r).
Accumulated changing amount calculation means 30 uses error data "E" and threshold values .theta.2(q) to calculate accumulated changing amount .DELTA..theta.2(q). Also, using the error data "E" and threshold values .theta.3(r), it calculates accumulated changing amount .DELTA..theta.3(r).
The accumulated changing amounts .DELTA.W12(p,q), .DELTA..theta.2(q), .DELTA.W23(q,r) and .DELTA..theta.3(r) calculated in the individual accumulated changing amount calculation means 20 and 30 are added, in the adding means 23, 33, 24 and 34, respectively, to W12(p,q), .theta.2(q), W23(q,r) and .theta.3(r) which are the weighting data and threshold values respectively stored in the memory means 21, 31, 22 and 32. The weighting data and threshold values after the additions or change are stored again in the memory means 21, 31, 22 and 32.
In such conventional back propagation, the learning speed and precision of a neural network largely varied depending on the characteristics (in the case of a sigmoid function, the value of gradient "k") of the transfer function of individual neurons (in the above described example, sigmoid function), and sometimes the learning did not advance or the error could not converge.
A table 2 shows the result obtained by the learning of the problem of exclusive-OR as shown in Table 1 by a neural network which uses a sigmoid function as a transfer function according to the prior art. The neural network used in the learning comprises two neurons for the input layer, three neurons for the hidden layer and one neuron for the output layer. Table 2 shows the error sum of squares after the learning was performed ten thousand times respectively for the gradients "k" of the sigmoid function being fixed at 0.1, 1.0, 2.0, 3.0 and 10.0. The error sum of squares is the error data for one-time learning, namely, the mean value of four kinds of E's calculated when all of the four input patterns are supplied, and it is defined by equation (11).
As mentioned above, the learning speed and precision of the neural network greatly changes when the gradient "k" of the sigmoid function varies. Accordingly, a learning with high precision can be performed in a short time if the gradient "k" is set to an optimum value.
However, only the fact that the optimum value of the gradient "k" depends on the problem and the number of the neurons constituting the neural network has been revealed by the studies which have been made up to now, and the rule of trial and error must be applied for each problem to find the optimum value of "k", which is cumbersome and time-consuming. In addition, sufficient precision cannot be obtained unless the value of "k" is set at a different value for each neuron.