An artificial neural network is a type of information processing system whose architecture is inspired by the biologically-evolved neural systems found in animals. The interest of computer scientists in the study of biological neural systems, like the human brain, began when it was discerned that traditional programming and engineering techniques could not create pattern recognition systems to mimic functions that people do so easily every day, such as understanding language, recognizing a face in a crowd or associating an abstract concept with an event.
One goal that is central to the study of biological neural systems is to create an artificial neural system that can be dynamically modified in response to external input. The hope is to create systems that can "learn". The present invention, which will be described below, presents a method and apparatus to speed the "learning" process for specific types of neural networks called multilayered, feed-forward neural networks, which use logistic activation functions.
There are various architectures and approaches to "learning" in neural network design which are well-known in the art and generally described in text books, such as Rumelhart, et al., Parallel Distributed Processing, MIT Press 1986. However, a brief description of neural network architecture and the various approaches to "learning" provide important background for the present invention.
A neural network comprises a highly interconnected set of simple processing units. The network is designed to accept a set of inputs, called an input pattern, process the input pattern data, and return a set of outputs called an output pattern. The theory is that a pattern can be recognized by mapping the input through a large set of interconnected, simple processing units. Although dependent on other units in the network for input, each unit in a neural network can be configured to independently process its input (e.g., in parallel) with the other units. In this sense, a neural network can be thought of as one form of a parallel distributed processing system.
The architecture of each simple processing unit is roughly based on the structure of biological neuron cells found in animals. The basic processing unit of an artificial network is called an artificial neuron unit (hereinafter used interchangeably with the term "unit") and is designed to replicate the basic anatomy of a biological neuron's dendrites, cell body, axon and synapse. Generally, an artificial neuron unit is configured to receive a large number of inputs, either from data input sources or from other artificial neuron units to replicate the way a biological neuron receives input signals from a plurality of attached dendrites. An artificial neuron unit mimics the activity of the cell body of the biological neuron through the use of threshold and output functions. A threshold function accepts all input and performs a function to determine whether the sum of the input plus any previously existing activation input surpasses a threshold value. If so, the neuron will process the input according to an output function and send an output signal to the plurality of other similarly configured neurons that are connected to it. Generally, the threshold function and output functions are combined into one function, collectively called an activation function, which accepts all inputs and maps them to an output value in one step.
The connections between the individual processing units in an artificial neural network are also modeled after biological processes. Each input to an artificial neuron unit is weighted by multiplying it by a weight value in a process that is analogous to the biological synapse function. In biological systems, a synapse acts as a connector between one neuron and another, generally between the axon (output) end of one neuron and the dendrite (input) end of another cell. Synaptic junctions have the ability to enhance or inhibit (i.e., weight) the output of one neuron as it is input to another. Artificial neural networks model this synaptic function by weighing the inputs to each artificial neuron.
The individual units can be organized into an artificial neural network using one of many different architectures. A "single level" architecture is defined to have no hierarchical structure - any unit can communicate with any other unit and units can even feedback inputs to themselves. Of general relevance to the present invention is another type of architecture called a layered hierarchical architecture. In such architecture, the units of the artificial neural network are grouped into layers and the network of interconnections is dictated by the layering scheme. Networks are commonly configured into two-layer and multilayer schemes. A two-layer scheme comprises an input layer and an output layer, each layer comprising neural network units. This architecture is commonly referred to as a "one-step" system. A multilayer neural network comprises an input layer of units and output layer of units connected to one or more levels of middle-layers, comprising units, that are often called "hidden" layers.
A particular type of multilayer neural network is called a feed-forward neural network, which facilitates "bottom up" processing. As stated in Rumelhart et al., the primary characteristic of a feed-forward network is that the units at any layer may not affect the activity of units at any layer "lower" than it. Thus, as inputs are first fed to units at the lower layers, processed, and input to succeeding layers, processing in those networks is performed from the "bottom up". Additionally in a pure feed-forward neural network, there are no units that accept inputs from more than one layer. The learning method and apparatus of the present invention, which will be described below, is particularly suited for use in multilayered, feed-forward neural networks.
Much of the study of "learning" in neural networks has focused on the use of multilayered architectures, because of the inherent use limitations found in two-layered architectures. Studies have shown that a network without internal representations (i.e., without hidden units) is unable to perform mappings where an input pattern of one configuration is mapped to an output pattern that is dissimilar. However, studies have also shown that if there is a large layer of hidden units, properly connected, any input pattern can always be mapped to any output pattern, even where the input and the output patterns are dissimilar. In fact, multilayered, feed-forward neural network architectures are among the most common in the neural network field. Thus, the applicability of the improved learning method of the present invention to multilayered, feed-forward neural network architectures is of great relevance to the field.
Neural networks are not programmed to recognize patterns - they "learn." Learning here is defined as any self-directed change in a knowledge structure that improves performance. Neural network systems do not access a set of expert rules that are stored in a knowledge base, as expert systems do. Moreover, the previously used input patterns are not maintained or saved in neural networks for later matching against new input. Rather, what is stored are connection strengths (i.e., the weight values) between the artificial neuron units. The weight value set comprising a set of values associated with each connection in the neural network, is used to map an input pattern to an output pattern. In contrast to the expert rules explicitly stored in expert system architectures, the set of weight values used between unit connections in a neural network is the knowledge structure. "Learning" in a neural network means modifying the weight values associated with the interconnecting paths of the network so that an input pattern maps to a pre-determined or "desired" output pattern. In the study of neural network behavior, learning models have evolved that consist of rules and procedures to adjust the synaptic weights assigned to each input in response to a set of "learning" or "teaching" inputs. Most neural network systems provide learning procedures that modify only the weights--there are generally no rules to modify the activation function or to change the connections between units. Thus, if an artificial neural network has any ability to alter its response to an input stimulus (i.e., "learn" as it has been defined), it can only do so by altering its set of "synaptic" weights.
Of general relevance to the present invention is a group of learning techniques classified as pattern association. The goal of pattern association systems is to create a map between an input pattern defined over one subset of the units (i.e., the input layer) and an output pattern as it is defined over a second set of units (i.e., the output layer). The process attempts to specify a set of connection weights so that whenever a particular input pattern reappears on the input layer, the associated output pattern will appear on the second set. Generally in pattern association systems, there is a "teaching" or "learning" phase of operation during which an input pattern called a teaching pattern is input to the neural network. The teaching input comprises of a set of known inputs and has associated with it a set of known or "desired" outputs. If, during a training phase, the actual output pattern does not match the desired output pattern, a learning rule is invoked by the neural network system to adjust the weight value associated with each connection of the network so that the training input pattern will map to the desired output pattern.
Virtually all of the currently used learning procedures for weight adjustment have been derived from the learning rule of psychoanalyst D. O. Hebb, which states that if a unit, u.sub.j, receives an input from another unit, u.sub.i, and both are highly active, the weight, w.sub.ji, in the connection from u.sub.i to u.sub.j should be strengthened. D. O. Hebb, The Organization of Behavior, (New York, Wiley, 1949).
The Hebbian learning rule has been translated into a mathematical formula: EQU w.sub.ji =g(a.sub.j (t), t.sub.j (t)) h(o.sub.i (t), w.sub.ji) (1)
The equation states that the change in the weight connection w.sub.ji from unit u.sub.i to u.sub.j is the product of two functions: g(), with arguments comprising the activation function of u.sub.j, a.sub.j (t), and the teaching input to unit u.sub.j, t.sub.j (t), multiplied by the result of another function, h(), whose arguments comprise the output of u.sub.i from the training example, o.sub.i (t), and the weight associated with the connection between unit u.sub.i and u.sub.j, w.sub.ji.
This general statement of the Hebbian learning rule is implemented differently in different kinds of neural network systems, depending on the type of neural network architecture and the different variations of the Hebbian learning rule chosen. In one common variation of the rule, it has been observed that: EQU h(o.sub.i (t), w.sub.ji)=i.sub.i ( 2)
and EQU g(a.sub.j (t), t.sub.j (t))=.eta.(t.sub.j (t)-a.sub.j (t)) (3)
where i.sub.i equals the ith element of the output of unit u.sub.i (or the input to u.sub.j), and .eta. represents a constant of proportionality. Thus, for any input pattern p the rule can be written: EQU .sub.p w.sub.ji =.eta.(t.sub.pj -o.sub.pj)i.sub.pi =.eta..delta..sub.pj i.sub.pi ( 4)
where t.sub.pj is the desired output (i.e., the teaching pattern) for the jth element of the output pattern for p, o.sub.pj is the jth element of the actual output pattern produced by the input pattern p, i.sub.pi is the value of the ith element of the input pattern. Delta .delta..sub.pj is the "delta" value and is equivalent to t.sub.pj -o.sub.pj ; this difference represents the desired output pattern value for the jth output unit minus the actual output value for the jth component of the output pattern. .sub.p w.sub.ji is the change to be made to the weight of the connection between the ith and jth unit following the presentation of pattern p.
The solution for the values of .sub.p w.sub.ji has been shown to be the inverse of one of a common type of optimization problems known as "hill climbing problems". A "hill climbing" problem can be characterized as the problem of finding the most efficient way to reach the "peak of a hill", which in mathematical terms represents a maximum value of a function. However, the inverse is to descend the hill and find a minimum value for that function. One common method for finding the .sub.p w.sub.ji values is to show that the partial derivative of the error measure with respect to each weight is proportional to the weight change dictated by the delta rule (4), multiplied by a negative constant of proportionality, and solve that analogous derivative problem. The solution for the derivative problem corresponds to performing the steepest descent on the surface of a terrain in a weight space (i.e., descending the hill), where the height at any point is equal to the error measure corresponding to the weights. Thus, the weight adjustment problem can be thought of as an attempt to find the minimum error E in the equation: EQU E=F(w.sub.o, . . . w.sub.n) (5)
for a given input pattern. The function F can be graphed to show a terrain of weight space points mapping the E value to the corresponding set of weights for a given input pattern in the neural network - this is the "hill". E represents the sum of the squared differences between the values of the actual output pattern and the desired input pattern. FIG. 1 graphs an example weight space for a neural network having only two weights, w.sub.1 and w.sub.2. To find the lowest value of E in the graph, the process is to look for the lowest point in the weight space terrain (i.e., the bottom of the hill). The gradient at any given point on the weight space terrain is the path of steepest descent to a minimum. The gradient descent method of solving the hill climbing problem is to find that steepest descending slope and follow it to a low point of the terrain. Because the gradient descent method provides a minimum solution of the derivative problem, the method also provides the proper weight change for the weights in a neural network. The derivative problem described above is proportional to the weight change dictated by the delta rule (4). Thus, where: EQU E.sub.p =1/2.SIGMA..sub.j (t.sub.pj -o.sub.pj).sup.2 ( 6)
it can be shown that: ##EQU1##
This second statement is proportional to the equation for .sub.p w.sub.ji, as stated by the delta rule (4) above.
Of general relevance to the learning method and apparatus presented by the present invention is the fact that the difficulty in solving the gradient descent problems varies between neural networks and depends upon the type of network architecture and activation function used. For example, where the neural network is arranged in the form of a two-layer network and the activation function for the units is linear (i.e., one that is capable of being represented by a straight line on a graph), the surface of the weight space terrain will be parabolic. The solution to the gradient descent problem for a parabolic surface terrain is easily found and gradient descent techniques are guaranteed to find the best set of weights for a given training input set, because it is easy to find the minimum for a parabolic surface.
When the architecture of the neural network is multilayered (i.e., including layers of hidden units), on the other hand, the terrain of the error space is not consistently parabolic. It has been shown that the graph of the weight space terrain for a multilayered network usually has a complex terrain surface with many minima. The lowest minimum values on the terrain represent solutions in which the neural network reaches a minimum error state at a value called a global minimum. The less deep minimum values are called local minima. FIG. 2 depicts a two-dimensional view of a weight space terrain with global and local minima. In such cases, gradient descent techniques may not find the best solution, if its slope of steepest descent points only to a local minima. However, it has been shown that in most cases it is not critical that a learning method using gradient descent techniques find a global minimum, so long as some minimum value is reached. As will be described below, however, it has been a particular problem to find any minimum value.
Additionally, in applying the delta rule to a multilayered network, it is often difficult to find the gradient that will enable the descent technique to operate when using certain activation functions. It has been shown that where the activation function used in the units of a multilayered neural network is a semilinear function, it is possible to find the partial derivative for the gradient descent method and to solve for w.sub.ji according to a form of the delta rule. A semi-linear function is defined as an equation in which the output of the unit is a nondecreasing and differentiable function of the total net output of the network. One commonly used semi-linear activation function is the logistic function: ##EQU2## where .theta..sub.j is a bias that performs a function similar to the threshold function described above. A logistic function is defined to be one divided by the sum of one plus the natural number e exponentially multiplied to the power of a negative value. The use of this activation function is a multilayered feed-forward neural network is of general relevance because the present invention is particularly suited for such network architectures.
Presently, the commonly available techniques for performing gradient descent-type weight (i.e., "learning") adjustments have been limited to forms of a technique called backpropagation. Backpropagation is the process of taking, for a given training input pattern, the collective error (found by comparing the actual output pattern with a desired output pattern), propagating that error back through the neural network, by apportioning a part of it to each unit, and adjusting the weight value of each connection by the "delta" values found through application of the generalized form delta rule (7) mentioned above, i.e.: ##EQU3##
The backpropagation technique has two phases. During an input phase, an input pattern is presented and propagated in a forward pass through the network to compute the output value for each unit in the network. An actual output pattern is then compared to a predetermined desired output pattern, resulting in a delta .delta. value error term for each output unit. For output units, the error is computed by the equation: EQU .delta..sub.pj =(t.sub.pj o.sub.pj) .function.'.sub.j (net.sub.pj) (10)
where .function..sub.j (net.sub.pj) is the partial derivative of the activation function for the units in the network.
The second phase consists of a backward pass through the network during which the delta error terms are passed to each unit in the network and a computation is performed to estimate the portion of the total error attributable to a particular unit. For units in the hidden layers, it has been shown that the calculation of the delta value is: ##EQU4##
After computing the delta values for the output units as indicated above, the backpropagation technique then feeds the computed error terms back to all of the units that feed the output layer, computing a delta value for each of those units, using the formula above. This propagates the errors back one layer, and the same process is repeated for each layer with new delta values at each unit used to adjust the connection weights.
Studies have shown that the use of backpropagation techniques to solve the gradient descent problems presents a number of inherent difficulties. First, there is a slow rate of convergence, that is, finding of the local minimum. Studies have also shown that the rate of convergence tends tot slow down as a local or global minimum is approached. Moreover, following the gradient vector does not always lead to a global minimum (or even a local minimum), because it is possible that a backpropagation method could get "lost" on a plateau in the terrain.
Additionally, the current techniques of backpropagation have difficulty adjusting the learning rates. The learning rate is defined to be the step taken along the path of steepest descent (i.e., the gradient vector) or other path of convergence to arrive at a local minimum. The currently available backpropagation methods create only uniform steps toward the minimum.
One particular problem occurs in identifying the likelihood that the path of convergence along gradient is closely following another convergence path following a "ravine" in the weight space terrain. On the graph of the weight space terrain, it is common for the path of convergence to a minimum to roughly follow a long and narrow ravine in the terrain (such as an elongated quadratic surface). When the path of finding a local minimum follows the center line of a ravine instead of the gradient vector, the path of convergence can proceed at an extremely fast learning rate. However, the currently available gradient descent techniques are not equipped to identify the likelihood that the center line of a ravine is nearby, because the center line is not always the path of steepest descent at a given point. Those systems are blind, because they have no ability to recognize an identifiable approach to a minimum along a ravine and use that knowledge to take a greater learning step. Current techniques tend to zig-zag around the ravine, finding the local minimum only after much needless searching. If the currently available techniques could identify a ravine, the vector path toward the minimum could be adjusted to follow the ravine. Thus, there exists a current need for an improved gradient descent learning method that can more quickly find a global or local minimum for a given gradient vector and additionally adjust the gradient vector after identifying a ravine.