Artificial neural networks ("ANNs") are well known in the prior art. The role of an ANN is to perform a non-parametric, nonlinear, multi-variate mapping from one set of variables to another. ANN 10 of FIG. 1 illustrates such a mapping by operating on input vector 12 to produce output vector 14. To perform this mapping, a training algorithm is applied to deduce the input/output relationship(s) from example data. Such ANN training algorithms are also well known in the prior art.
Prior to training, an ANN is initialized by randomly assigning values to free parameters known as weights. The training algorithm takes an unorganized ANN and a set of training input and output vectors and, through an iterative process, adjusts the values of the weights. Ideally, by the end of the training process, presentation of a vector of inputs from the training data to the ANN results in activations (outputs)at the output layer that exactly match the proper training data outputs.
The basic unit that makes up an artificial neural network is variously known as an artificial neuron, neurode, or simply a node. As depicted in FIG. 2, each ANN node 16 has a number of variable inputs 15, a constant unity input 19 (also known as a bias or a bias input), and an output 17. The variable inputs correspond to outputs of previous nodes in the ANN. Each input to a node, including the bias, is multiplied by a weight associated with that particular input of that particular node. All of the weighted inputs are summed. The summed value is provided to a nonlinear univariate function known as a transfer or squashing function. The purpose of a squashing function is two-fold: to limit (threshold) the magnitude of the activation (output) achievable by each node, and to introduce a source of non-linearity into the ANN. The most commonly applied transfer functions for continuous mappings include the hyperbolic tangent function or the sigmoid function, which is given below: ##EQU1##
As expressed in Equation 2 below, x.sub.j.sup.k, the output of node number j belonging to a layer k, is simply the transfer function .phi. evaluated at the sum of the weighted inputs. ##EQU2##
In equation (2) X.sub.i.sup.k-1 is the activation of node number i in the previous layer, w.sub.ij represents the weight between node j in the k-th layer and node i in the previous layer, and .phi. represents the transfer function.
A basic feedforward artificial neural network incorporates a number of nodes organized into layers. Most feedforward ANNs contain three or more such layers. An example ANN is illustrated in FIG. 3. ANN 10 consists of an input layer 18 communicatively connected to one or more hidden layers 20. Hidden layers 20 can be communicatedly connected to one another, to input layer 18 and to output layer 22. All layers are comprised of one or more nodes 16. Information flows from left to right, from each layer to the next adjacent layer.
The nodes in the input layer are assigned activation values corresponding to the input variables. The activation of each of these nodes is supplied as a weighted input to the next layer. In networks involving three or more layers, the interior layers are known as hidden layers. After one or more hidden layers, the final layer, known as the output layer, is reached. The activations of the nodes of the output layer correspond to the output variables of the mapping.
The interior layers of the network are known as hidden layers to distinguish them from the input and output layers whose activations have an easily interpretable relationship with something meaningful. The hidden layers perform an internal feature detection role, and thus harbor an internal representation of the relationships between the inputs and outputs of the mapping, but are usually not of use to and are generally "hidden" from the attention of the user.
As previously mentioned, transfer functions help the mapping by thresholding the activations of nodes. This is desirable as it forces the ANN to form distributed relationships and does not allow one or a few nodes to achieve very large activations with any particular input/output patterns. These requirements and restrictions upon the behavior of the ANN help to ensure proper generalization and stability during training and render the ANN more noise-tolerant. However, a consideration raised by transfer functions is that they generally cause ANN outputs to be limited to the ranges [0,1] or [1,1]. This necessitates a transformation to and from the ranges of the output variables and the transfer function. In practice, a network is trained with example inputs and outputs linearly scaled to the appropriate range--just within the tails of the transfer function. When the network is deployed, the inputs are again scaled, but the outputs of the network are usually "descaled" by applying the inverse of the scaling function. The de-scaling provides real-world units and values to the otherwise unit-less fractional values generated by the ANN.
When a network is generated or initialized, the weights are randomly set to values near zero. At the start of the ANN training process, as would be expected, the untrained ANN does not perform the desired mapping very well. A training algorithm incorporating some optimization technique must be applied to change the weights to provide an accurate mapping. The training is done in an iterative manner as prescribed by the training algorithm. The optimization techniques fall into one of two categories: stochastic or deterministic.
Stochastic techniques include simulated annealing and genetic-algorithms and generally avoid all learning instabilities and slowly locate a near global optimum (actually a minimum in the error surface) for the weights. Deterministic methods, such as gradient descent, very quickly find a minimum but are susceptible to local minima. Whichever category of optimization is applied, sufficient data representative of the mapping to be performed must be selected and supplied to the training algorithm.
Training data selection is generally a nontrivial task. An ANN is only as representative of the functional mapping as the data used to train it. Any features or characteristics of the mapping not included (or hinted at) within the training data will not be represented in the ANN. Selection of a good representative sample requires analysis of historical data and trial and error. A sufficient number of points must be selected from each area in the data representing or revealing new or different behavior of the mapping. This selection is generally accomplished with some form of stratified random sampling, i.e., randomly selecting a certain number of points from each region of interest.
Most training algorithms for feedforward networks incorporate one form or another of a gradient descent technique and collectively are known as back-propagation training. The term back-propagation describes the manner in which the error gradient calculation propagates through the ANN. The expression for the prediction error .delta..sub.j at some node J in the output layer is simply the difference between the ANN output and the training data output. EQU .delta..sub.j.sup.output =X.sub.j.sup.desired -X.sub.j.sup.output Equation (4).
The expression for the error at some node i in a previous (to the output) layer may be expressed in terms of the errors at the subsequent nodes to which node i is connected. ##EQU3##
These error terms, along with neuron activations throughout the net and an additional training parameter a called the learning rate (which takes a positive value generally less than unity), provide the necessary information to adjust the weights throughout the ANN. The following expression for the weight update between node i in one layer and node j in the next is known as the general delta rule (GDR). ##EQU4##
The weights throughout the network are updated as above each time a training pattern is presented to the ANN. To avoid learning instabilities when different patterns are pulling the weights back and forth, and also to generally converge faster, the cumulative delta rule (CDR) is frequently employed. The CDR uses the same expression for the weight update as the GDR, but all weight updates are accumulated and implemented at the same time each time the entire training dataset is presented to the ANN.
In order to help avoid learning instabilities known as local minima, which are concave areas on the weight surface where the gradient goes to zero, a term is usually added to either the general or cumulative delta rule that sometimes helps carry the weights outside the local minima. The resultant expression, equation (7) below, is called the general or cumulative delta rule with momentum. The parameter .beta. associated with the momentum term is set to a value of (0,1) and is referred to simply as the momentum. EQU .DELTA.w.sub.ij .sup.k+1 (t)=.varies..delta..sub.j.sup.k+1 x.sub.i.sup.k +.beta..DELTA.w.sub.ij (t+1) Equation (7).
The back-propagation learning rules expressed above must be applied many times in the weight optimization process. Sometimes the value of the learning rate is not held constant throughout the training but is instead allowed to vary according to a schedule. For example, for the first 10,000 presentations of the training dataset, .alpha. might be set to 1. This corresponds to the ANN taking bold steps through the weight space. For the next 10,000 presentations, a might be set to 0.7. As the ANN trains, the reduction continues and the ANN takes more timid and refined steps. This learning rate schedule assumes that large steps are appropriate at the start of training and that very small steps help find the very bottom of the local area at the end of training.