1. Field of the Invention
The present invention generally relates to methods for enhancing the performance of artificially intelligent systems (AIS) employing artificial neural networks. More specifically, the present invention relates to a method for providing a more accurate training set for an AIS, whereby the performance of the AIS is enhanced.
2. Description of the Prior Art
Artificially intelligent systems employing artificial neural networks are now widely used in both military and industrial applications. Typical applications include pattern classification, pattern completion, function approximation, optimization, prediction, and automatic control. Artificial neural networks are patterned after the intricate and sophisticated neural system of the human brain and are thereby able to mimic human intelligence. A neural network thus makes available a powerful additional approach for problem solving having capabilities which are not available using other more conventional approaches.
Basically, all artificial neural networks perform essentially the same function. They accept a set of inputs and produce a corresponding set of outputs, an operation called vector mapping. Likewise, all neural network applications are special cases of vector mapping.
FIG. 1 illustrates a general view of a neural network as a vector mapper. As shown, an artificial neural network 10 accepts a set of inputs (an input vector) and produces a set of outputs according to some mapping relationship encoded in its structure. For example, FIG. 2 shows a system that maps an input vector with three components--height, weight, and age, into an output vector with two components--life expectancy and insurance premium.
The nature of the mapping relationship between input and output vectors is defined by the values of free variables (often called weights) within the network. FIG. 3 shows a configuration where weights (the numbers on the interconnecting lines 14) scale the inputs to processing units 12 (circles). FIG. 4 shows a special case, the feed-forward network, in which the signals flow only from input to output. The mapping relationship between input and output vectors may be static, where each application of a given input vector always produces the same output vector, or it may be a dynamic, where the output produced depends upon previous, as well as current, inputs and/or outputs. Since feed-forward networks have no memory, they are only capable of implementing static mappings. Adding feedback allows the network to produce dynamic mappings.
Artificial neural networks learn from experience. This characteristic, perhaps more than any other, has created the current interest in these methods. In addition to the anthropomorphic implications (that are usually inappropriate), learning offers a powerful alternative to programming. Learning methods may be broadly grouped as supervised and unsupervised with a great many paradigms implementing each method.
In supervised learning, the network is trained on a training set consisting of vector pairs. One vector is applied to the input of the network; the other is used as a "target" representing the desired output. Training is accomplished by adjusting the network weights so as to minimize the difference between the desired and actual network outputs. The network is trained by applying each input vector and adjusting the network weights in a direction that minimizes the error between the network output and the desired output vector. This process may be an iterative procedure, or weights may be calculated by closed-form equations.
In iterative training, application of an input vector causes the network to produce an output vector. This is compared to the target vector, thereby producing an error signal which is then used to modify the network weights. This weight correction may be general, equally applied as a reinforcement to all parts of the network, or it may be specific, with each weight receiving an appropriate adjustment. In either case the weight adjustment is intended to be in a direction that reduces the difference between the output and target vectors. Vectors from the training set are applied to the network repeatedly until the error is at an acceptably low value. If the training process is successful, the network is capable of performing the desired mapping.
Unsupervised learning (sometimes called self-organization) requires only input vectors to train the network. During the training process the network weights are adjusted so that similar inputs produce similar outputs. This is accomplished by the training algorithm that extracts statistical regularities from the training set, representing them as the values of network weights.
Classification is a special case of vector mapping and has an extremely broad range of applications. Here, the network operates to assign each input vector to a category. For example, an input vector might represent the values of the leading economic indicators on a specific date; the two classes might be "Dow Jones Up" and "Dow Jones Down" on the following day.
Classification is a central concern in the study of artificial intelligence. An efficient and effective replication of the human's ability to classify patterns would open the door to a host of important applications. These include interpretation of handwriting, connected speech, and visual images. Unfortunately, this goal has been elusive. In most cases humans still perform these tasks far better than any machine devised to date.
The classification decision is based both upon measurements of the object's characteristics and upon a data base containing information about the characteristics and classifications of similar objects. Therefore, implicit in this process is the collection of data that characterize the statistical properties of the objects being classified.
For example, a lumber mill might wish to automatically separate pieces of pine, spruce, oak, and redwood, putting them into separate bins. This classification could be accomplished by making a set of measurements on the piece, such as color, density, hardness, etc. This set of measurements, expressed numerically, forms a feature vector for that piece, where each measurement is a vector component. The feature vector is compared with a set of recorded feature vectors and their known classifications, those comprising a training set. By some method, a decision is made regarding the correct classification.
The ability to successfully train an artificial neural network is of vital importance to its successful utilization in a practical application. A key event in the history of artificial neural networks was the development in 1982 of a feed-forward network trained by back-propagation. For the first time a theoretically sound technique was available to train multilayer, feed-forward networks with nonlinear neurons. The power and generality of such networks was known for many years, but prior to backpropagation there was no efficient, theoretically sound method for training their weights.
Today, backpropagation (or one of its many variations) is by far the most commonly applied neural network training method. Unfortunately, backpropagation is a mixed blessing, exhibiting a number of serious problems while training. For example, the user is required to select three arbitrary coefficients: training rate, momentum, and the range of the random weight initialization. There is little theory to guide their determination. An unfortunate choice can cause slow convergence or network paralysis where learning virtually stops. Paralysis usually requires a complete retraining, thereby losing all benefit from what may have been days of computation. Also, due to the minimum error seeking nature of the algorithm, the network can become trapped in a local minimum of the error function, arriving at an unacceptable solution when a much better one exists.
While these problems are serious, perhaps backpropagation's most onerous characteristic is its long training time. Training sessions of days, even weeks, are common; this is not merely an inconvenience. While in theory training need only be done once, system development inevitably requires a certain amount of iterative optimization, particularly when developing features from the data set. This means that the backpropagation network must often be trained many times. In most practical applications this imposes long waiting periods in the development process. Given the tight schedules which seem characteristic of today's projects, promising alternatives may go unexplored, parameter optimization may be prematurely terminated, and non-optimal results are often produced. Although attempts have been made to devise heuristics to increase training speed, the results are highly problem dependent, and prediction of effectiveness is difficult. Nevertheless, with all of these problems, backpropagation remains a highly effective paradigm.
FIG. 5 is an example of a feed-forward network with a single nonlinear hidden layer and a linear output layer, wherein backpropagation is used for training. While there are many elaborations, such as adding more layers, or making the output layer nonlinear, this configuration is adequate to approximate any continuous function.
In operation, a set of inputs, x.sub.1, x.sub.2, . . . x.sub.m is applied to the network. It is useful to think of these as comprising an input vector X. The network performs calculations on this vector producing a set of outputs, out.sub.1, out.sub.2, out.sub.p, collectively referred to as the output vector "out".
In the first layer, the input row vector X is multiplied by weight matrix W.sup.1, thereby accomplishing the multiplication and summation indicated in FIG. 5. The product produces a vector called net. Each component of net is then operated on by nonlinear function f(), thereby producing the vector y. Thus, EQU y=f(XW.sup.1)
where X and y are row vectors; each column of W.sup.1 contains the weights associated with a single hidden layer neuron, and f() operates on the product in a component-by-component fashion f() is often the sigmoidal logistic function: EQU y=1/(1+e.sup.-.beta..spsp.net)
Large values of .beta. produce a steep function approaching a step, whereas small values produce a smoother function. .beta. will be assumed to be 1.0. The range of the logistic function is 0.0 to 1.0. Since it bounds the neurons output between these values, it is often called a squashing function. Often the function's range is offset by subtracting 0.5, a change that makes training somewhat faster. The logistic function has an important computational advantage in that its derivative (which is used during training) has the simple form: EQU f(net)[1-f(net)]
The hyperbolic tangent is another commonly used squashing function having a range from -1.0 to +1.0: EQU y=tanh(net)
Many other squashing functions may be used. The choice does not seem to be critical as long as the function is nonlinear and bounds the neuron's output.
The output layer in FIG. 5 multiplies vector y by weight matrix W.sup.2 producing output vector "out". Hence, EQU out=f(xW.sup.1)W.sup.2
Since supervised training is used in backpropagation, a training set is required consisting of vector training pairs. Each training pair is composed of an input vector x and a target vector t. The target vector represents the set of values desired from the network when the input vector x is applied.
Before training, the network weights are initialized to small, random numbers. The optimal range of these numbers is problem dependent, however, it is safest to start with a range around .+-.0.1 While larger values can accelerate convergence on some problems, they can also lead to network paralysis.
The object of training is to adjust the weight matrices so that the network's actual output is more like the desired output. More formally, the algorithm minimizes an error measure between the output vector and target vector. This error measure is computed and the weight adjusted for each training pair. While many error measures are possible, in this case the error measure used is: ##EQU1## where i is the number of components in the output vector.
Alternatively, the error may be averaged over all vector pairs in the training set, in which case the error measure is: ##EQU2## where n is the number of training vectors in the training set.
If only two weights in a network are considered, their values can be thought of as defining the x-y coordinates of a point on a table top. Visualization breaks down if more weights are involved. As shown in FIG. 6, the error may be visualized as a rubber sheet hovering above that surface, where the height of the sheet at each point is determined by the error at that x-y position. The rubber sheet represents the error surface. With more weights, the rubber sheet image is still useful, but its height at every point on the sheet is now a function of more variables.
The backpropagation training algorithm uses gradient descent, a multi-dimensional optimization method used for hundreds of years. In essence, the method changes each weight in a direction that minimizes the error. This change may be done at the time each input vector is applied, or, changes may be averaged and weights changed after all input vectors have been seen. In either case, many passes through the training set may be required to reduce the error to an acceptable value.
Probabilistic neural networks, a direct outgrowth of earlier work with Bayesian classifiers, represents another prior art approach for artificial neural networks. This approach uses supervised training, is nonrecursive, and unlike backpropagation, provides for rapid training. However, a practical difficulty lies in its use of the entire training set for each classification. This increases storage requirements and lengthens classification times. Various clustering techniques have been developed that reduce the size of the training set. However, with memory prices dropping rapidly, this is not the problem that it once was. Nevertheless, classification times can become much longer than for feed-forward networks if the training set is large.
Another prior art neural network approach which provides for rapid training uses radial basis-function neural networks. Closely related to these networks is a general regression neural network (GRNN) which has the particular advantage of not requiring any training at all. Unfortunately, these networks require all, or a substantial portion, of the training set to be involved in their operation. As a result, after training, these networks are slower to use, requiring more computation to perform a classification or function approximation.
Considering GRNN in more detail, it will be understood that GRNN is based on nonlinear regression theory, a well-established statistical technique for function estimation. By definition, the regression of a dependent variable yon an independent variable x estimates the most probable value for y, given x and a training set. The training set consists of values for x, each with a corresponding value for y (x and y are, in general, vectors). Note that Y may be corrupted by additive noise. Despite this the regression method will produce the estimated value of y which minimizes the mean-squared error.
GRNN is based upon the following formula from statistics: ##EQU3## Where: y=output of the estimator
X=the estimator input vector PA1 E(y.vertline.x)=the expected value of out, given the input vector X PA1 F(x,y)=the joint probability density function (pdf) of X and y PA1 h.sub.i =exp [-D.sub.i.sup.2 /(2.sigma..sup.2)], the output of a hidden layer neuron PA1 D.sub.i.sup.2 =(x-u.sub.i).sup.T (x-u.sub.i) (the squared distance between the input vector x and the training vector u) PA1 x=the input vector (a column vector) PA1 u.sub.i =training vector i, the center of neuron i(a column vector) PA1 .sigma.=a constant controlling the size of the receptive region. PA1 Input Vector: 1 2 5 3 PA1 Target Vector: 3 2 PA1 1. Assign the weight going to the output neuron of the same class the value 1. PA1 2. Assign weights going to all other output neurons the value 0. These will have no effect upon the output neurons: therefore, the weight can be removed from the network.
GRNN is, in essence, a method for estimating f(X,y), given only a training set. Because the pdf is derived from the data with no preconceptions about its form, the system is perfectly general. There is no problem if the functions are composed of multiple disjoint non-Gaussian regions in any number of dimensions, as well as those of simpler distributions.
The function value y.sub.j for the GRNN is estimated optimally as follows: ##EQU4## where: W.sub.ij =the target (desired) output corresponding to input training vector x.sub.i and output j
FIG. 7 shows the general form of a GRNN. In GRNN, instead of training the weights, one simply assigns to w.sub.ij the target value directly from the training set associated with input training vector i and component j of its corresponding output vector. Consider a single pair of vectors in the training set, one an input vector and the other a target vector. A hidden layer neuron is created to hold the input vector. For simplicity, assume that the target vector has only one component and the network has only one output neuron. The weight between the newly created hidden neuron and the output neuron is assigned the target value. If there is more than one output neuron, the weight from the hidden neuron to each is given the corresponding target value.
As a concrete example of GRNN training, suppose that the training set contains the following training pair:
As shown in FIG. 8, a hidden layer neuron is created to hold the input vector. Two weights go from this neuron to output neurons. These weights y.sub.1 and y.sub.2 have values 3 and 2, respectively, corresponding to the target vector.
As another GRNN example, FIG. 9 shows a network designed by this method which solves the celebrated exclusive-or problem. The values of .sigma. are chosen to be small; for a given input one and only one hidden layer output is 1, the rest are nearly ). Clearly, if the weights of the hidden layer neuron equal the input vector x, the distance between them will be zero and the above equation for y.sub.j will evaluate to 1. The small value of .sigma. causes the rest to be arbitrarily close to 0, thereby solving the problem.
Since GRNN can approximate any continuous function, it can easily be made into a classifier. A classifier has binary target vectors, each of which has a single one indicating the target class. All other components are zero; the GRNN network used as a classifier will have its output layer weights set to ones and zeros. Classification then consists of applying an input vector and determining which output is greatest.
Each GRNN hidden neuron represents an input training vector, and each has that vector's classification as well as its weights. The following procedure (a specialization of the general method described above) is used to set the weights of a GRNN classifier.
For each hidden neuron:
Performing the above operations on the GRNN produces the network of FIG. 10.
The above discussion sets forth various prior art information regarding AIS and artificial neural networks. Additional information can be obtained from the book "Advanced Methods in Neural Computing", by Philip D. Wasserman, Van Nostrand Reinhold, New York, N.Y. 1993. More detailed information as to GRNN can be obtained from the article "A General Regression Neural Network", by Donald F. Specht, IEEE TRANSACTIONS ON NEURAL NETWORKS, Vol. 2, No. 6, November 1991, pp. 568-576.