I. Field of Invention
This invention relates in general to an improved method of performing Back Propagation in the training of Neural Networks, and more specifically to a method utilizing pulse trains as the information transmittion mechanism within a Neural Network.
II. Background Art
As used herein, a neural network performs a mapping from input data to output data, i.e.: it receives a set of inputs, in some form, from an external source and from them produces a set of outputs, in some form. A normal computer is a typical device which performs just such a function. A Back Propagation Neural Network is supposed to learn or adapt to perform the mapping by being given stereotypical examples of what it is supposed to do. This is in constrast to a normal computer which must be told how to perform a mapping function.
Implementing Neural Networks in Integrated Circuits (IC) is desirable.
One of the major difficulties in implementing a neural network on an IC is that, in the theoretical ideal, a neural network is inherently an analog process which necessitates a great number of analog components, including analog multipliers. Conventional Neural Networks, and Back Propagation Neural Networks in particular, which use analog methods, are quite sensitive to Offset Errors. Offset Errors are a particular king of inaccuracy inherent in analog multipliers (and analog amplifiers in general). Analog multipliers, instead of giving the desired Z=A.times.B, give instead Z=(A+Offset.sub.A).times.(B+Offset.sub.B)+Offset.sub.Z. Because of the offset errors, a Network often is unable to converge to a reasonable answer, and therefore cannot learn to perform the desired mapping function.
Also, the Back Propagation algorithm utilizes an almost arbitrary, nonlinear function and its derivative. These two functions must be farily accurate, lest the algorithm is not able to learn. Implementing these two functions with the required accuracy can be difficult.
This invention discloses a family of physical devices that overcome these problems.
Original theoretical approaches towards neural networks are based upon the idea that when two neurons in the brain are active there is a correlation between them. One early rule developed by D. O. Hebb is described in his book "The Organization of Behaviour", Wiley, 1949. The Hebbian rule states that when two neurons are firing simultaneously an association link between them is strengthened. Accordingly, the next time either of the two neurons fires, the other one is more probable to fire also. However, the Hebbian rule is not a sufficient model to explain the learning process. Under the Hebbian rule, the connection strengths between neurons grow without bound. If maximums are placed on the connection strengths, these maximums are always reached.
Subsequently, the Perceptron Model was developed by Frank Rosenblatt, and is discussed in his book "Principles of Neurodynamics", Spartan, 1962. The Perceptron Model was originally believed powerful enough to enable a machine to learn in a human-like manner.
The Perceptron Model includes input, hidden and output layers; each comprised of one or more processing elements. In response to input stimuli, the input layer provides information to the hidden layer. Similarly, the hidden layer provides information to the output layer. Connections between the input and hidden processing elements are fixed; connections between the hidden and output processing elements are adjustable.
In the Perceptron Model, if the inputs are boolean (i.e. either zero or one), then the intended purpose of the hidden layer is to extract some kind of features from the input data. However, if the inputs to the Model are continuous numbers (i.e. having more than two distinct values, rather than just two boolean values), then the hidden layer is not used. Instead, the outputs of the input layer are connected directly to the inputs of the output layer.
In the Perceptron Model, all learning takes place in the output layer. Under the Perceptron Model many problems have been experimentally and mathematically shown to be representable by connection strengths between layers. Rosenblatt's Perceptron Learning Algorithm enables a neural network to find a solution if there exists a representation for that problem by some set of connection strengths. Rosenblatt's Perceptron Convergence Proof is a well known mathematical proof that a Perceptron System will find a solution if it exists.
In operation, the Perceptron Model modifies the strengths of the weighted connections between the processing elements, to learn an appropriate output response corresponding to a particular input stimulus vector. The modification of the connection weights occurs when an incorrect output response is given. This modification of the weights changes the transfer of information from the input to output processing elements so that eventually the appropriate output response will be provided. However, through experimentation, it was discovered that the Perceptron Model was unable to learn all possible functions. It was hoped that these unlearnable functions were only pathological cases, analogous to certain problems that humans cannot solve. This is not the case. Perceptron Systems cannot represent and learn some very simple problems that humans are able to learn and represent.
An example of a problem that the Perceptron Model is unable to represent (without 2.sup.N hidden processing elements, where N is the number of input nodes), and therefore cannot learn, is the parity or "exclusive-or" boolean function. To perform such a problem (with fewer than 2.sup.N hidden processing elements) a system would require two layers of modifiable weights. The Perceptron System cannot properly adjust more than one layer of modifiable weights. It was speculated that no learning mechanism for a system with multiple layers of modifiable weights would ever be discovered because none existed (Minsky & Papert, 1969, in "Perceptrons").
(The problem with using 2.sup.N hidden units is three-fold. First, since the hidden units, in the Perceptron Model, do not adapt, all the units must be present, regardless of the function which needs to be learned, so that all functions can be learned. Second, the number of units required grows phenomenally; for example, 2.sup.34 is approximately 17 billion, more neurons than in a human brain; this means that the largest parity problem the human brain could solve, if wired in this manner, would have at most 32 inputs. Third, the system would not generalize; given two input/output vector pairs near one another, one trained and the other not, the system should be able to interpolate the answer from the first; with a large number of hidden units, it has been experimentally shown that this is not the case.)
Almost all adaptive neural systems share several features in common. Typically the processing elements of all systems have an output which is a function of the sum of the weighted inputs of the processing element. Almost all systems have a single layer of modifiable weights which affect the data transferred from the input to the output of the system.
The evolution of adaptive neural systems took a dramatic step forward with the development of an algorithm called "Back Propagation". This algorithm is fully described in the reference text "Parallel Distributed Processing, the Microstructure of Cognition", Rumelhart, Hinton, & Williams, MIT Press, 1986.
A back propagation system typically consists of three or more layers, each layer consisting of one or more processing elements. In one basic example, the system is comprised of an input layer, at least one hidden layer and an output layer. Each layer contains arbitrary, directed connections from the processing elements in the input layer to the hidden layer, and from the hidden layer to the output layer. There are no connections from processing elements to processing elements in the same layer nor connections from the output to the hidden layer nor from the hidden to the input layer; i.e. there are no cycles (loops) in the connection graph. (There are hypothesized mechanisms for networks with cycles in them, but they are not being scrutinized herein.)
In the Perceptron Model the idea of error was introduced. In a back propagation system, at each output processing element of the network, the error is quite easily realized. The error is typically the difference between an expected value and the output value. This error is used to modify the strength of the connection between a processing element and the output processing element. Ideally, this reduces the error between the expected output and the value output by the processing element in response to the input. The Perceptron Model lacks the ability to allocate an error value to the hidden processing elements and therefore cannot adjust the weights of any connections not coupled to an output processing element. In a system utilizing the Back Propagation algorithm, an error is assigned to the processing elements in hidden layers and the weights of the connections coupled to these hidden processing elements can be adjusted.
An acyclic Neural Network is comprised of only three layers of processing elements: the input, the hidden and the output layers. Each layer consists of one or more processing elements. There may be connections from the input to the hidden layer (input matrix elements), from the hidden to the output layer (output matrix elements), from the input to the output layer (direct matrix elements), and from hidden processing elements to other hidden processing elements (hidden matrix elements). In an acyclic network, a large constraint is placed on hidden matrix elements: if the hidden processing elements are numbered 1 to N, a matrix element may only connect from a lower numbered hidden processing element to a higher numbered processing element; remember that the directionality of a matrix element is important.
This three-layer description actually produces all possible layered environments; it describes an acyclic graph.
An acyclic Back Propagation Neural Network consists of the following.
A neural system utilizing backwards error propagation can be represented by two kinds of elements: processing elements and matrix elements.
A matrix element connects two processing elements and its primary function is to store the connection strength.
A processing element receives a net data and a net error signal, and produces a data and an error signal, which are functions of the two received signals. The functions can be mathematically expressed as: EQU Output.sub.i =f(NetInput.sub.i) (1) EQU Error.sub.j =f'(NetInput.sub.j).times.NetError.sub.j ( 2)
A matrix element receives a data and an error signal and produces a net data and a net error signal which are a function of the two received signals. The functions can be mathematically expressed as: ##EQU1##
The derivation of the discrete time Back Propagation Algorithm is described in Chapter 8 of Parallel Distributed Processing, by Rumelhart et al, and is recounted here.
A weight associated with a connection is referred to as w.sub.ji. The subscripts are used in the form w.sub.to,from. Hence, in the variable w.sub.ji, i refers to the processing element from which data information is being received, and j refers to the processing element to which data information is sent. In the back propagation algorithm, a particular input stimulus vector is referred to collectively by the variable p (for pattern). The elements of a particular output vector and particular target vector are referred to respectively as o.sub.pj and T.sub.pj, where j varies over the output processing elements. The Total Error of a system is represented by the variable E. The portion of the Error contributed by a single input vector (one input pattern) is represented by the variable E.sub.p.
The output of a processing element o.sub.pj, in response to an input pattern p, is calculated by the following equation (which also defines the value net.sub.pj): ##EQU2## EQU o.sub.pj =f(net.sub.pj) (7)
The techniques used by Back Propagation to minimize the Total Error is a variant of Least Mean Squared. The technique states that the total error is the square of the difference between the target vector and the output vector. Furthermore, it is assumed that the total error for the system is the linear summation of the error for any individual pattern. ##EQU3## In a Back Propagation network, error is minimized by adjusting the weights within the network. What is desired is to determine what amount to adjust a weight so that the error will be reduced. The following equation expresses that desire: ##EQU4## The above expression can be expanded by the chain rule to get: ##EQU5## We can expand the second component, .differential.net.sub.pk /.differential.w.sub.ji, by noting that ##EQU6## to get the following ##EQU7## It is easy to see that except when m=i and k=j, the above is zero. Putting this back into equation (11) we get: ##EQU8## The first portion of the equation, .differential.E.sub.p /.differential.net.sub.pj, by expansion using the chain rule, gives: ##EQU9## and .differential.O.sub.pl /.differential.net.sub.pj can be simplified by recognizing O.sub.pl =f(net.sub.pl). By substituting this in, the expression becomes: ##EQU10## It can now be noted that .differential.f(net.sub.pl)/.differential.net.sub.pj is zero, except when l=j; this gives us finally: ##EQU11## and this can be substituted back in to get: ##EQU12## If we are examining an output node, the value of .differential.E.sub.p /.differential.O.sub.pj is readily apparent from the definition of E.sub.p, as in: ##EQU13## Partial differentiation of this expression with respect to O.sub.pj gives the following expression for output processing elements: ##EQU14## Thus the error equation for an output value is: ##EQU15## The problem remains what the error value is for the hidden processing elements. To determine this, let the definition of .delta..sub.pj be: ##EQU16## From the expansion from above, we see: ##EQU17## Expanding by the chain rule on o.sub.pj, we get: ##EQU18## Expanding .differential.net.sub.pk /.differential.o.sub.pj, by the definition ##EQU19## we get: ##EQU20## It is easy to see that the above is zero, except when l=j, so that we can state: ##EQU21## Substituting this back into the above equation, we get: ##EQU22## By the definition of .delta..sub.pj, we can then state: ##EQU23## Therefore, .delta..sub.pj for a hidden node can be expressed as: ##EQU24## Combining all the above elements together, we get: ##EQU25## and from this, the Total Error equation can be formulated: ##EQU26## For an output processing element, .delta..sub.pj is: EQU .delta..sub.pj =f'(net.sub.pj)(T.sub.pj -O.sub.pj) (31)
For a hidden processing element, .delta..sub.pj is: ##EQU27## Now, the change of the weight is set proportional to the above partial differentiation. This is given by the following equation: EQU .gradient.w.sub.ji =.mu..delta..sub.pj o.sub.pi ( 33)
The constant of proportionality (.mu.) is the Learn Rate. Experimentally, this constant has been found to be in the range of 0.5 to very small, depending on the number of weights, processing elements and patterns which are to be presented.
Note that there is no guarantee that any one particular weight change for a particular pattern will decrease the total error; it is actually quite probable that during one of the patterns the total error will increase, just that over all the patterns the total error should decrease.
In summary, back propagation may be described as follows. On a forward pass of information through the network, all of the processing element outputs are calculated by propagating the information input forward through the network, i.e. from the input layer to each hidden layer in turn and finally to the output layer. On a backward pass, i.e. from the output layer to the hidden layers, each in reverse order from before and finally to the input layer, all the errors are calculated by propagating the associated error backwards through the network. Finally, all the weights are changed according to the errors in the processing elements above and the outputs of the processing elements below.
The Back Propagation Algorithm as originally developed and as described to this point is a discrete time algorithm, in that there is a forward pass, a backwards pass and modification to the weights, and then a recycling. However, this is not an optimal implementation of the system. There is an implicit assumption of linearity during these discrete time intervals. This is generally not a good assumption.
A better implementation of the system is for the network to run continuously, performing each of the operations simultaneously; this is the basis of what is called herein a continuous time system. The following is a derivation of the continuous time model of the Back Propagation Algorithm, as developed by this inventor. Instead of taking the derivative of the Error with respect to any particular weight, the derivative is taken with respect to time. It is desired to have the error monotonically decreasing, and it is shown that this is done in a straightforward manner. The chain rule can be applied taking the partial differentiation with respect to w.sub.ji. ##EQU28## Repeating equation (30), as derived in the discrete time algorithm: ##EQU29## This can then be replaced into equation (34) to give: ##EQU30## To ensure that the derivative of the Error is monotonically decreasing, the sign of dE/dt must be negative. The only way to do this is to ensure that the sign of dw.sub.ji /dt is the opposite sign of ##EQU31##
By arbitrarily setting ##EQU32## this constraint is satisfied, by giving us: ##EQU33## Since the derivative of the Error is monotonically decreasing, the system will converge at some final error value. As derived, a system is not guaranteed to converge to a zero Error. Experimental results show that a system generally will converge to zero error if the problem to be solved is representable by the network. It is not known at this time how to detirmine if a problem to be solved is representable by a particular network. If the system does not converge to a small error, or does not reliably converge, adding a small number of additional processing elements and connections will lead to convergence.
FIG. 1 is a diagrammatic representation of an acyclic Back Propagation Neural Network, having six processing elements: two input processing elements, two hidden processing elements and two output processing elements. This, of course, is a very small exemplary network, but from the drawings and description of this representative network it can be seen that a similar network can be constructed comprising thousands (or millions) or processing elements.
In FIG. 1, processing elements are denoted generally as 20's with input layer processing elements 20a and 20b, hidden layer processing elements 20c and 20d, and output processing elements 20e and 20f. As shown on processing element 20e, processing elements may have four lines: two output signal lines, Data 11 and Error 14; and two input signal lines, Net Data 13 and Net Error 12. The postscript "e" designated that the lines are associated with processing element 20"e".
Matrix elements are denoted generally as 10's. Matrix elements 10 have data, error, net data and net error lines which are connected to the similarly named lines of the connected processing elements, as diagrammed.
The two elements of the input vector are transfered respectively to the system via input stimulus 1 line 31 and input stimulus 2 line 32, which are connected to the Net Data lines of input processing elements 20a and 20b, respectively. The two elements of the output vector are available on Output Value 1 line 35 and Output Value 2 line 36, respectively, and are generated by the data lines of output processing elements 20e and 20f, respectively. The two elements of the error vector are transfered respectively to the system via Error Stimulus 1 line 33 and Error Stimulus 2 line 34, which are connected to the Net Error lines of output processing elements 20e and 20f, respectively.
FIG. 2 is a schematical block diagram illustration, a matrix representation, of the layout of the system diagrammatically represented in B FIG. 1.
Processing elements 20a' through 20f' correlate with processing elements 20a through 20f, respectively, of FIG. 1. Matrix elements 10a' through 10m' correlate with matrix elements 10a through 10m respectively, of FIG. 1. All signal lines, 11a' through 14f' correlate with signal lines 11a through 14f, respectively, of FIG. 1. The input, output and error lines 31' through 36' correlate with the input, output and error lines 31 though 36 of FIG. 1.
An input stimulus vector, comprised of input stimuli Input 1 on line 31' and Input 2 on line 32', are connected to processing elements 20a' and 20b', respectively, as is done in FIG. 1. The output of processing element 20a' is connected to matrix elements 10a' through 10d' via Data line 11a'. Similarly, the output of processing element 20b' is connected to matrix elements 10e' through 10h' via Data line 11b'. Matrix Elements 10a' and 10e' sum their Net Data outputs on Net Data line 13c'. This summation on 13c' is provided as the Net Data input to processing elements 20c'. Processing Element 20c' provides its Data output signal on Data line 11c', to the Data input line of Matrix Elements 10i' through 10k'. Matrix Elements 10b', 10f' and 10i' sum their Net Data output signals on Net Data line 13d', which is provided as the Net Data input signal to Processing Element 20d'. Processing Element 20d' provides its Data output signal on Data line 11d', to the Data input line of Matrix Elements 10l' and 10m'. Processing elements 10c', 10g', 10j' and 10l' sum their Net Data output signals on Net Data line 13e', which is provided as the Net Data input signal to Processing Element 20e'. Matrix elements 10d', 10h', 10k' and 10m' sum their Net Data output signals on Net Data line 13f', which is provided as the Net Data input signal to Processing Element 20f'.
Processing elements 20e' and 20f' provide output signals Output 1 and Output 2, respectively, on lines 35' and 36', respectively. These outputs form the output vector.
An error stimulus vector, composed of error stimuli Error 1 on line 33' and Error 2 on line 34' are received by the Net Error lines of Processing Elements 20e' and 20f', respectively, The Error output signal of Processing Elements 20f' is provided on Error line 14f' to Matrix Elements 10m', 10k', 10h' and 10d'. The Error output signal of Processing Elements 20e' is provided on Error line 14e' to Matrix Elements 10l', 10j', 10g' and 10c'. The Net Error outputs of Matrix Elements 10l' and 10m' are summed on Net Error line 12d' and is provided to the Net Error input line of Processing Element 20d'. The Error output signal of Processing Elements 20d' is provided on Error line 14d' to Matrix Elements 10i', 10f' and 10b'. The Net Error outputs of Matrix Elements 10i' through 10k' are summed on Net Error line 12c' and is provided to the Net Error input line of Matrix Element 20c'. The Error output signal of Processing Elements 20c' is provided on Error line 14c' to Matrix Elements 10e' and 10a'. The Net Error outputs of Matrix Elements 10e' through 10h' are summed on Net Error line 12b' and is provided to the Net Error input line of Processing Element 20b'. The Net Error outputs of Matrix Elements 10a' through 10d' are summed on Net Error line 12a' and is provided to the Net Error input line of Processing Element 20a'.
In the example the Error output signals of Processing Elements 20a' and 20b' are not used; often this will be the case, and as such a minimal system does not include the functional parts necessary to provide the Error output signal of input processing elements, such as 20a' and 20b', nor the functional parts to provide a Net Error output for the matrix elements connected to the input processing elements. The example is provided with the Error output signals of Processing Elements 20a' and 20b' and the Net Error output signals for Matrix Elements 10a' through 10h' for clarity and uniformity. A system can be built in this manner with no loss of generality.
Most Neural Networks sum the data inputs on a line and then provide a "squash" of the resultant summation, i.e., a non-linear function which reduces the range of the summation from the possible minus infinity to positive infinity range of the net input to some smaller dynamic range, such as from zero to one.
FIG. 3a illustrates a typical squash function used often in Back Propagation. Its mathematical formula is: ##EQU34##
In a Back Propagation Neural Network the derivative of the squash function that is used in the forward propagation of the data is required to modify the backwards propagation of error.
FIG. 3b illustrates a the derivative of the function illustrated in FIG. 9a. It mathematical formula is: ##EQU35## Producing hardware implementations of these functions with the required accuracy is difficult.
Prior art neural networks using the Back Propagation Algorithm have been frequently implemented on computer systems and are now being designed and built as fully analog VLSI circuits. These fully analog VLSI instantiations suffer from the design limitations of analog circuitry, in particular because of the offset errors of the analog circuits. While it has been shown that small networks of these fully analog circuits can be built, it has not been demonstrated that larger networks can be built utilizing fully analog mechanisms, and it is believed by this inventor that without significant circuitry or significant circuitry advances to alliviate these offset errors, a fully analog circuit will not be able to scale up.
Therefore there exists a need for: a new and improved adaptive neural network circuit design which will enable the system to overcome the difficulties associated with analog multiplier offset errors, and
a new and improved method for the calculation of a a"squash" function and its derivative, and
a new and improved method for the transmission of information along neural network pathways which enhances the networks' immunity to random noise interference.
The invention described herein is a family of circuits which are an instantiation of The Spike Model, after the main mechanism whereby information is transferred. The Spike Model is a mathamatical model, derived by the inventor.
It is generally believed that a neuron transmits (at least part of) the forward flow of data information by representing the information as the frequency of the firing of spikes along its axon. Most abstract neural models represent the firing frequency as a real number, rather than attempting to simulate each and every individual spike. Most neural network models sum the inputs and then provide a "squash" of the resultant sum when processing the incoming information for a single processing element. This is equivalent to summing the input frequencies, then squashing the resultant summation.
The invention utilizes a spike train as the primary method of forward data transmission and extorts several major advantages from it. In this model, instead of summing the inputs, the inputs are (in essence) logically OR'ed together. If two spikes occur simultaneously on the inputs, only a single spike gets through. The resultant "squash" function under this mechanism is (the assumptions and derivations are detailed later in this disclosure): EQU Q.sup.+ =1-e.sup.-net.spsp.+ ( 41)
where Q.sup.+ is the probability that any output of a unit is a one and net.sup.+ is (essentially) the total number of spikes being generated by the units.
This function is approximately the upper "half" of the stereotypical sigmoid "squash" functions currently used in most Back Propagation networks, where net.sup.+ is the weighted number of excitatory spikes.
Since the number of spikes is exactly the summation of frequencies, this gives not a summation of frequencies, but rather a nice "squash" function, without paying for it!
One of the next major requirements of the Back Propagation Algorithm is the backwards flow of error. Specifically, there is a linear summation of the backwards flowing error and a multiplication of it by the net input (number of spikes) run through the derivative of the squash function, i.e.: EQU backwards error.times.squash'(number of spikes) (42)
By examining the total of all the times between the pulses, one can find an interesting space. The amount of this time corresponds to: EQU OffTime.ident.1-OnTime (43) EQU or EQU OffTime=e.sup.-number of spikes ( 44)
This is exactly the derivative of the squash function (from equation 41). Therefore, if the error is propagated backwards when no spikes are present (or equivalently, only examined when there are no spikes present), the time averaged multiplication is exactly a multiplication of the error by the derivative of the squash function with respect to the net input--precisely what is desired!
The third component of Back Propagation is the way the weights are updated. This is (essentially): ##EQU36##
If the error in a node above is represented in a spike method, the calculation for the change of the weights is simple. It is (essentially) the logical AND of the forward flowing spiked data signal and the spiked error signal.
If all signals in the network are represented as spikes, much of the noise problems associated with pure analog signals is alleviated.
This disclosure derives various mathematical components of the model, and details various circuits which can be utilized in the instantiation of this model. It should be noted that this model retains essentially all the speed of a full analog implementation, in that all circuits can still be implemented in the psuedo-neural network. The only speed lost is that the frequencies of the spikes, which remaian the primary information carrying portion of the signal, can only be actually detected with substantially more than two spikes present. It appears that this time will need to be a factor of ten longer than theoretically anticipated with a pure analog instantiation.