This invention pertains to computer systems and more particularly is concerned with artificial neural networks. Artificial neural networks (also called connectionist networks) are powerful information processors. One advantage of neural networks is their general applicability to many types of diagnosis and classification problems. Such problems have a wide variety and large number of inputs (i.e., waveforms, spectra, images, and environmental measurements) which must be correlated to produce an output. The concept is based on research over the last 30 years in neurobiology, with efforts to simulate biological neural network functions (such as pattern recognition) in computers.
A neural network includes a plurality of processing elements called neural units arranged in layers, as shown schematically in FIG. 1. Interconnections are made between units of successive layers. A network has an input layer, an output layer, and one or more "hidden" layers in between. The hidden layer is necessary to allow solutions of nonlinear problems. Each unit functions in some ways analogous to a biological neuron; a unit is capable of generating an output signal which is determined by the weighted sum of input signals it receives and a threshold specific to that unit. A unit is provided with inputs (either from outside the network or from other units) and uses these to compute a linear or non-linear output. The unit's output goes either to other units in subsequent layers or to outside the network. The input signals to each unit are weighted either positively or negatively, by factors derived in a learning process.
When the weight and threshold factors have been set to correct levels, a complex stimulus pattern at the input layer successively propagates between hidden layers, to result in a simpler output pattern, such as only one output layer unit having a significantly strong output. The network is "taught" by feeding it a succession of input patterns and corresponding expected output patterns; the network "learns" by measuring the difference (at each output unit) between the expected output pattern and the pattern that it just produced. Having done this, the internal weights and thresholds are modified by a learning algorithm to provide an output pattern which more closely approximates the expected output pattern, while minimizing the error over the spectrum of input patterns. Neural network learning is an iterative process, involving multiple "lessons". Neural networks have the ability to process information in the presence of noisy or incomplete date and yet still generalize to the correct solution.
In contrast, some other approaches to artificial intelligence, i.e., expert systems, use a tree of decision rules to produce the desired outputs. These decision rules, and the tree that the set of rules constitute, must be devised for the particular application. Expert systems are programmed, and generally cannot be trained. Because it is easier to construct examples than to devise rules, a neural network is simpler and faster to apply to new tasks than an expert system.
FIG. 2 is a schematic representation of a neural unit. Physically a unit may be, for example, one of numerous computer processors or a location in a computer memory. Conceptually, a unit works by weighing all of its inputs, subtracting a threshold from the sum of the weighted inputs, resulting in a single output which may be a binary decision, 1 or 0. However, in practice a network has a more stable operation when each unit's output is smoothed, so that a "soft binary decision" in the range of values between 1 and 0 is delivered.
A unit multiplies each of its inputs with a corresponding weight. This process follows the so-called propagation rule and results in a net input which is the weighted sum of all the inputs. The unit's threshold is subtracted from the weighted sum, to provide a so-called linear decision variable. When the linear decision variable is positive (when the weighted sum exceeds the threshold), a binary decision would be 1; otherwise, the binary decision would be 0. The linear decision variable, however, is passed through a nonlinear mapping so that a range of values between binary 1 and 0 is obtained. The resulting nonlinear output is the smoothed decision variable for the unit which provides a more stable operation. An output near either end of the range indicates a high confidence decision.
The nonlinear mapping is an S-shaped curve output function. When the linear decision variable is very negative, the curve is nearly flat, giving values near the lower end of the range (binary 0 decisions with high confidence). When the linear decision variable is very positive, the curve is again nearly flat, giving values near the upper end of the range (binary 1 decisions with high confidence). Changes in the linear decision variable in the flat regions make very little difference in the unit output. When a unit makes a high confidence decision, it is quite insensitive to moderate changes in its inputs. This means the unit is robust. When the linear decision variable is zero, the mapping gives a unit output at the center of the range. This is interpreted as no decision, a low confidence condition in which 1 and 0 are deemed equally likely. This result is obtained when the weighted sum of inputs is equal to the threshold.
FIG. 3 is the curve of the mapping function ##EQU1## wherein the output range is [0,1]; in this case, an output of 1/2 is interpreted as no decision.
Because the linear decision variable is the weighted sum of all inputs to the unit minus the threshold, the threshold can be considered as just another weight. This way of looking at a unit's operation is useful in considering learning algorithms to adjust weights and thresholds, because it allows a unified treatment of the two.
As previously noted, units are connected together to form layered networks (which are also called manifolds). The inputs to units in the first layer are from outside the network; these inputs represent the external data to the network and are called collectively the network input. The outputs from units in the last layer are the binary decisions that the network is supposed to make and are called collectively the network output. All other units in the network are in so-called hidden layers. All inputs to units, except those in the first layer, are outputs of other units. All outputs from units in the first and hidden layers are inputs to other units. Although a unit has only a single output, that output may fan out and serve as an input to many other units, each of which may weigh that input differently.
Because the nonlinear mapping makes the units robust, neural networks are fault tolerant. If a unit fails, its output no longer reaches other units in the network. If those affected units were previously making high confidence decisions, and if this was based on many inputs (i.e., the network has a high degree of connectivity), this change in inputs will have only a small effect of the output. The result is that the damaged network provides almost the same performance as when undamaged. Any further performance degradation is gradual if units progressively fail.
The performance of a neural network depends on the set (matrix) of weights and offsets. Learning algorithms are intended to compute weights and offsets that agree with a set of test cases (lessons), consisting of known inputs and desired outputs. An adjustment pass is performed for each test case in turn, until some adjustment has been made for all of the cases. This entire process is repeated (iterated) until the weights and thresholds converge to a solution, i.e., a set of values with which the network will give high confident outputs for various network inputs.
For each test case, the mean square error between desired and actual output is calculated as a function of the weights and thresholds. This error is a minimum at some point in parameter space. If training were done with one test case alone, the weights and thresholds would converge to a particular set of values. However, any other test case, whether the desired output is the same or different, will have a different error at each point in parameter space. The appropriate learning is achieved by minimizing the average error, where the average is taken over all test cases. This is why it is desirable to make one adjustment for each test case in turn; the sequence of adjustments is very much like an ensemble average over the test cases. The preferred class of techniques to do this is called back propagation.
An example of a known neural network learning algorithm for a single step of back propagation is shown in FIG. 4. In the algorithm, which is written in Pascal language, the index k is the unit number, running from 0 to uc-1. The parameter uc is the unit count. The index numbers mi to uc-1 correspond to the units in the network. The index numbers 0 to mi-1 correspond to the mi external inputs, i.e., the data, to the neural net. The units with these index numbers can be thought of as dummy units: they have a fixed output (which is the neural net input with that index number), and no inputs, no weights, and no threshold. This artifice is used solely for consistency, so that for the purposes of the algorithm all unit inputs, including those of the first layer, are outputs of other units.
The pass in the forward direction, from input layer to output layer, shows how the algorithm produces outputs for each unit. For the unit numbered k, the inputs are summed with appropriate weights (the variable "ws") and an offset (the variable "ot," equal to minus the unit's threshold) is added. Note that if this code is to calculate the output of a net with instantaneous propagation, all connections must be made from a unit numbered i to a unit numbered k such that k is greater than i. That is, the connection matrix ws must be strictly upper triangular. Inputs from units with higher number are delayed by one sampling interval.
The variable "on" is the linear decision variable of the unit, the weighted sum of unit inputs offset by minus the threshold. The variable "out" is a nonlinear mapping of "on".
The pass in the backward direction, from output layer to input layer, implements the back propagation algorithm. It utilizes the steepest descent gradient for minimizing the mean square error between network output and desired output. The mean square error is ##EQU2## where the sum is over those units that supply network outputs, and "rf" is the reference output desired. The quantity "pf" (the primary feedback, so called for its role in other learning algorithms) is the partial derivative of this error with respect to "on." (There is an overall factor of -2, which is discussed below.) For those units in the output layer which supply network outputs, the derivative can be calculated directly. For units that supply inputs to other units, the calculation requires repeated application of the chain rule. (The derivation is given, for example, in Learning Internal Representations by Error Propagation, Chapter 8 of Parallel Distributed Processing, Rumelhart, Hinton, and Williams, MIT Press, Cambridge, Mass., 1986.) The algorithm allows for the possibility that a unit will provide both an external network output and an input to other units. The factor of out (1-out) is simply the derivative of the particular nonlinear mapping function (logistic activation function) used, i.e., ##EQU3##
The final step of the algorithm changes the weights and the thresholds by an amount proportional to the derivative of the error with respect to these quantities. The step size is scaled by the parameter d, which is adjusted empirically. It is possible to make this proportionality constant different for each unit in the network. The factor of -2 mentioned above is absorbed in the constant; the minus sign ensures that the step is in the direction of decreasing error.
The threshold can be considered as just another weight, corresponding to a dummy input that is always +1 for every unit. If this is done, the weights and thresholds can be treated in exactly the same way.
While this algorithm will in time reach a solution, many passes (iterations) are needed. An object of this invention is to provide a neural network which reaches a solution with significantly fewer iterations than required by the prior art.