Neural networks are systems that are deliberately constructed to make use of some of the organizational principles from the human brain. There are at least ten billion neurons, the brain's basic computing elements, in the human brain. Each neuron receives inputs from other cells, integrates the inputs, and generates an output, which it then sends to other neurons. Neurons receive inputs from other neurons by way of structures called synapses and send outputs to other neurons by way of output lines called axons. A single neuron can receive on the order of hundreds or thousands of input lines and may send its output to a similar number of other neurons. One theoretical model of a single neuron will have incoming information at many synapses. At the synapse, the input is assigned a variable synaptic weight, as determined by an associative process. The sum of the weighted inputs is compared to a threshold and if the sum exceeds the threshold, then the neuron fires an action potential, which is sent by way of the neuron's axon. Other variations on this scheme exist in both the biological and computational domains.
Neural networks attempt to mirror the structure of the human brain by using massively parallel interconnection of simple computational elements. This set of connections is often arranged in a connection matrix. The overall behavior of the neural network is determined by the structure and strengths of the connections. If the problem structure is well defined, it is possible to specify the connection strengths beforehand. Otherwise, it is necessary to modify the connection strengths using a learning algorithm, thereby making the system adaptable to changes in the input information. There is often a decoupling between the training phase and the retrieval phase of operation of a network.
In the training phase, the connection strengths and the network are modified. A large number of patterns that are representative of those which the neural network must ultimately classify are chosen as a training set. In supervised learning, the desired classifications of the patterns in the training set are known. The training input vectors are input into the network to produce output responses. The network then measures the actual output response against the predetermined desired output response for the input signal and modifies itself such that the output response would more closely approach the desired output response. After successive training cycles, the network is conditioned to respond uniquely to the particular input signal to provide the desired output signal. Some examples of learning algorithms for perceptron type networks include the Perceptron Convergence Procedure which adapts the synaptic weights based on the error between the desired and actual outputs, and the Least Mean Squares (LMS) solution, a special case of the backpropogation learning algorithm for multilayer perceptrons, which minimizes the mean square error between the desired and actual outputs. After the network has been trained, some novel information, in the form of an input vector or activity pattern, is put into the system. The novel input pattern passes through the connections to other elements, giving rise to an output pattern which contains the conclusions of the system.
There are a number of ways of organizing the computing elements in neural networks. Typically, the elements are arranged in groups or layers of neurons. Single-layer and two-layer systems, with only an input and an output layer, are used extensively. The perceptron, first introduced by Frank Rosenblatt in a multilayered context, see Principles of Neurodynamics, New York, Spartan Books (1959), and U.S. Pat. No. 3,287,649 to Rosenblatt, and later distilled down to a single layer analysis by Marvin Minsky & Seymour Papert, see Perceptrons, MA, MIT Press (1969) is one learning machine that is potentially capable of complex adaptive behavior. Classifying single layer perceptron 2, as shown in FIG. 1, decides whether an input belongs to one of two classes, denoted class A or class B. N input elements 6, x.sub.0 -x.sub.N-1, are weighted by corresponding weights 8, w.sub.0 -w.sub.N-1, and input to single node 4. Input elements 6 are n-dimensional vectors. Node 4 computes a weighted sum of input elements 6 and subtracts threshold 10, .theta., from the weighted sum. The weighted sum is passed through hard limiting nonlinearity, f.sub.H, 12 such that output 14, y, is either +1 or 0. Equation 1 represents output 14: ##EQU1##
Rosenblatt's original model of perceptron 2 used a hard limiting nonlinearity 12, where: ##EQU2## Other nonlinearities may be used, especially in a multilayer perceptron, such as a sigmoid nonlinearity, where: EQU f.sub.S =(1+e.sup.-.beta..SIGMA.w.sbsp.i.sup.x.sbsp.i.sup.-.theta.).sup.-1
The gain of the sigmoid, .beta., determines the steepness of the transition region. As the gain gets very large, the sigmoidal nonlinearity approaches a hard limiting nonlinearity. A sigmoidal nonlinearity gives a continuous valued output rather than a binary output produced by the hard limiting nonlinearity. Therefore, while the hard limiting nonlinearity outputs a discrete classification, the sigmoidal nonlinearity outputs a number between 0 and 1, which can be interpreted as a probability that a pattern is in a class.
If single node perceptron 2 is given a set of patterns to classify, such a classification can be learned exactly only if a hyperplane can separate the points. This property is called linear separability. In Equation 1, the input into the hard limiting nonlinearity f.sub.H is a weighted sum of the input elements times the connection strengths. If the sum is greater than threshold .theta., f.sub.H takes a value of 1, and y represents Class A. If f.sub.H is less than the threshold, then y takes the value of 0 and represents class B. If there are N inputs into perceptron 2, then all possible inputs to perceptron 2 are represented in N-dimensional space. When the sum of the product of the synaptic weights times the coordinates of the inputs equals the threshold, the equation is that of a hyperplane in the N-dimensional space. In FIG. 2, hyperplane 20, separating all points in Class A from those in Class B in a two dimensional case, is generally represented by Equation 2: EQU x.sub.1 =-w.sub.0 /w.sub.1 x.sub.0 +.theta./w.sub.1 Equation 2
Therefore, if a hyperplane, or line in FIG. 2, can separate the points from the two classes, then the set of input patterns is linearly separable.
Recently, networks that develop appropriate connection strengths in multilayer networks have also been used. Multilayer networks are feed forward networks with one or more layers of nodes between the input and output nodes. Sometimes, the input node is also considered a layer. While single layer perceptrons can only partition an input space into simple decision regions, such as half planes bounded by a hyperplane, it has been recognized that multilayer networks are capable of partitioning an input space into virtually any form of decision regions, thereby allowing the network to generate convex open or closed regions and even arbitrarily complex decision regions, if the network includes more layers. It is therefore important to select a large enough number of nodes to form decision regions that are required for a given concept but not so large such that all the synaptic weights required cannot be reliably ascertained.
When designing a neural network, many design goals must be addressed. First, the learning algorithm for the network ideally is an Occam algorithm with respect to a class of patterns for it to be useful. Generally, an Occam algorithm is defined as an algorithm that makes a network:
1. Consistent PA1 2. Sufficiently small with respect to the complexity of the problem. PA1 [A] class C of boolean functions is called learnable if there is an algorithm A and a polynomial p such that for every n, for every probability distribution D on R.sup.n, for every c.epsilon.C, for every 0&lt;.epsilon., .delta.&lt;1, A calls examples and with probability at least 1-.delta. supplies in time bounded by p(n, s, .epsilon..sup.-1, .delta..sup.-1) a function g such that EQU Prob.sub.x.epsilon.D [c(x).noteq.g(x)]&lt;.epsilon.
Consistency requires that the trained neural network does not misclassify any input patterns. To be sufficiently small with respect to the complexity of the problem, a sufficiently simple model to explain a given concept must be used. "Sufficiently small" has a precise mathematical definition given in Learnability and the Vapnik-Chervonenkis Dimension, Anselm Blumer et al., Journal of the Association for Computing Machinery, Vol. 36, No. 4 (1989). If an algorithm is an Occam algorithm, then a class of patterns with a given complexity will be "poly-learnable," that is, the network will learn from a polynomial number of examples, rather than an exponential number of examples. This ensures that the amount of time, the number of examples, the resources necessary or the amount of memory needed to classify a class of patterns is acceptable. As described by Eric B. Baum, Polynomial Time Algorithm That Learns Two Hidden Unit Nets, (1990):
Here s is the "size" of c, that is, the number of bits necessary to encode c in some "reasonable" encoding.
This model thus allows the algorithm A to see classified examples drawn from some natural distribution and requires that A output a hypothesis function which with high confidence (1-.delta.) will make no more than a fraction .epsilon. of errors on test examples drawn from the same natural distribution.
The Baum article ensures that the amount of time and number of examples necessary to learn a model within a target error, .epsilon., will be acceptable.
Another set of design goals that must be addressed when designing a neural network and its learning algorithm is to make the learning algorithm constructive and to appropriately prune the size of the neural network. To be constructive, a learning algorithm does not decide beforehand on the size of the required network for a particular problem, but rather "grows" the network as necessary. While constructive algorithms "grow" the network as needed, most do not ensure that the resulting network will be minimal. The Occam principle raises evidence that using a network that is larger than necessary can adversely affect the network's ability to generalize. Therefore, pruning the network after learning has taken place to reduce the size of the network is desirable.
When building a neural network, a local or a global model can be built. In a local model, the synaptic weights, which define the particulars of the model during the learning phase, pertain to localized regions of the input space. In a global model, on the other hand, the weights are used to classify a large portion of the input space, or in some cases, the entire input space. The advantage to a local model is that the information implicit in the weights is applied to the classification only where appropriate, thereby avoiding over generalizations of local features. The advantage of a global model is that a small amount of information can, at times, be used to classify large areas of input space. A local-global model combines the advantages of each, looking at local features when appropriate, but also classifying large areas in the input space when possible. Therefore, what is desirable in a learning algorithm is to have both global properties, wherein a small model is used for large, uniform input regions and local properties, wherein the network can learn details without unlearning correct aspects of the model.
A nearest neighbor pattern classifier can be used to classify linearly separable data sets, among others. A nearest neighbor type classifier has both global and local properties. FIG. 3 shows a linear decision surface classifying patterns of opposite types to build a model out of input patterns. Class A pattern 32 and class B pattern 34 could be a pair of examples or clusters of data that are linearly separable. Decision surface 30 classifies patterns 32 and 34 using a nearest neighbor type classification. If patterns 32 and 34 are a pair of examples of opposite types or classes, line 36 is drawn between the data of a first type, such as class A pattern 32, and its nearest neighbor of the opposite type, such as class B pattern 34. Decision surface 30 bisects line 36 to create two regions of classification in the model. Thus, the nearest neighbor of the opposite type classification of FIG. 3 essentially reduces the problem to a two point problem. A nearest neighbor type classifier has global properties because a small number of classification regions can fill the entire space as well as local properties because classification regions may be made more local by the bounding effects of neighboring regions. The drawback to nearest neighbor type classifiers is that they contain no provision for reducing the size of the model. All the examples input during the training cycle are stored and become part of the model. Since the network size is not minimized, the size of the network can adversely affect the ability of the network to generalize. Further, if many pairs of samples of opposite type exist, the nearest neighbor type classifier, for two sample or two cluster discrimination, cannot perform multiple linear separations necessary to sufficiently partition the input space.