The present invention relates to artificial neural networks and more particularly to a method of training and using the same to perform classification when data is unavailable or scarce for one or more classes to be identified. A neural network consists of many simple, densely interconnected, processing elements (PE) or units. The memory of the network resides not on the individual PE's but on the connections which are weighted. The weight of a connection is analogous to the strength of a synapse between the dendrites of two neurons in the brain. In a three-layer feedforward neural network, the first layer consists of input units where each unit simply receives a single component, i.e., data feature, of the input vector and transmits it to all units in the next layer called the hidden layer. Each unit in the hidden layer receives input from all input units weighed by connection weights, processes this input, and transmits an output to each unit of the output layer, again via weighted connections. The same processing of inputs occurs in the output units, resulting in a final output vector. Typically each neural unit, except the input units, sums the weighted inputs, passes the sum through a sigmoidal function, and outputs the result to the next layer of units.
Neural networks learn by exposure to a set of training examples. During the training phase, the connection weights in the network are adjusted in such a way as to minimize the error in the network output. A popular example of a training algorithm is the Backpropagation algorithm applied to the feed-forward network (see Rumelhart D. E., McClelland, J. L. and the PDP Research Group, "Learning Internal Representations by Error Propagation," Parallel Distributed Processing, The MIT Press, Cambridge, MA, 1986). In Backpropagation (BP), the difference between the actual network output and the correct output is used to adjust the weights of the connections to the output layer. The typical way to express the output error is in terms of the mean squared error: ##EQU1## where n is the number of output nodes, t.sub.i is the desired (target) value at output node i, and o.sub.i is the actual value at output node i for a given input vector. In turn, the errors in the output layer are "backpropagated" to adjust the connections for the adjacent hidden layer. These adjustments are iterated layer by layer until all connection weights are updated. The training cycle is repeated until the weights stabilize.
In recent years there has been much research focused on the development of automated systems. Many automation problems like pattern recognition, speech recognition, system monitoring, and automated diagnostics require distinctions between different states of the world. This problem of state distinction can often be described as a classification problem. As a result of this need for automated classifiers, many classification methods have been developed, from heuristic rule systems to artificial neural networks to varied statistical methods.
In spite of the common use of classifiers in many automatic decision-making applications, there are some important applications that are not amenable to the standard classifiers, including the standard neural network. Important classes of such applications include novelty, or unanticipated event, detection, and fault detection. An example of the former class is in sonar signal classification, where there is a need to recognize that a signal belongs to a previously unknown, but significant, source. An example of the latter class is sensor-based monitoring where the task is to interpret multiple sensor outputs and determine if the monitored system is operating normally. These classes of applications are characterized by having a wealth of data about some of the permissible classes, e.g., normal operating conditions, and a dearth of data about others, e.g., different faulted states.
Actual examples of sensor-based monitoring applications include jet engines and machine tool monitoring. Jet engines contain a suite of mounted sensors that are used to periodically measure engine parameters. It is a critical but difficult problem to analyze the often voluminous data to detect faults. Since engines rarely fail, the data base contains very few examples of failure data. In engine part machining, machine tools need to be monitored for breakage and the machine stopped to prevent loss of an expensive workpiece. Since standard machining practice is already such as to minimize tool breakage, the collected data represents mostly normal cutting.
In antisubmarine warfare, classification by sonar is a key technique for identifying the presence of enemy ships. Sonar signal patterns are recognized using a database of signals collected from various sources. The database is always incomplete since new signal sources continuously evolve, and the classifier, whether human or machine, is measured by how well it can recognize those signals that are different from any prior known signals.
Credit card fraud costs financial institutions millions of dollars and a significant effort is spent in trying to detect the fraudulent activity. Although some number of fraudulent cases are available for reference, the best criteria for detecting fraud is by detecting significant deviation from normal account activity.
Each of the above cases serves as an example of the class of problems involving detection of abnormal data patterns. Abnormality detection is a subclass of pattern classification problems. The latter are concerned with the determination of which of M classes is representative of an unknown input pattern containing N elements, or features. Thus the input pattern could be the pixels of an image and the output is one of several objects; or the input pattern could contain information on a credit application and the output is the accept or reject decision. For a typical pattern classification problem, data is assumed to exist for each of the output classes, and developing a classifier is to find the optimal class boundaries in the decision space defined by the input features.
The abnormality detection problem would be a standard pattern classification problem if there existed ample examples of each of the abnormalities. However, with most practical detection problems of interest, abnormality examples are scarce or missing altogether. If standard classification techniques are applied to such biased data sets, the classifier will likely make errors in favor of the abundant example class, i.e., it will generalize erroneously. Intuitively, the desired class boundary is one that tightly defines the decision space occupied by the highly represented class. This boundary may have to take on a highly nonlinear shape, perhaps even defining disconnected regions.
There are several traditional (non-neural network) methods for boundary determination. Perhaps the simplest and easiest method is to look at all of the known n-dimensional (for an n-feature problem) data points and to take the maximum and minimum of each feature one at a time. This will determine a hypercube boundary for the data. This is a very fast method for calculating the boundary with much enclosed space near the "corners" of the boundary.
Another very simple but effective method for calculating a boundary is to set a maximum threshold on the distance between a point and the nearest known point. This near neighbor threshold is like the nearest neighbor classification method but a threshold distance is used instead of the minimum distance from one class or the other. The near neighbor method can detect rather complex boundaries and provides a consistent way to calculate whether a point is inside or outside for any number of dimensions. This method tends to be rather slow however, because any new point must be compared with every other point. Also, every known point must be stored. The threshold must also be set to some value, and assumptions are generally made about the properties of the boundary, in order to set that threshold value.
A method similar to the near neighbor method is the potential function method. This method calculates the decaying potential between a test point and all of the other known points in space. If the potential is greater than some potential, the point is classified as inside, otherwise it is outside. This has most of the advantages and disadvantages of the nearest neighbor method. It does provide a slightly different boundary surface however, since it takes into account the density of inside points in space.
In a data-trending approach, heuristic decision rules are established through laborious analysis of the dana. Data is plotted against those features that are suspected of being detection sensitive, and thresholds are found for those features which discriminate between normal and abnormal. The thresholds are usually chosen to maximize the likelihood of detecting abnormality but still yielding an acceptable level of false alarms, i.e., falsely classifying a normal case as abnormal. Often thresholding one feature is insufficiently discriminating and multiple features must be thresholded in a decision tree scheme. Developing such decision criteria can involve weeks of data analysis. Manual generation of such detection algorithms is difficult and is likely to be suboptimal since the developer is unable to thoroughly search for the best feature set.
It would be useful therefore, to extend the current neural network pattern classifier to the domain of highly biased data problems. Such an extension of currently available technology would limit the neural network's generalization power, thereby providing classification according to a decision boundary that tightly bounds, i.e., minimally generalizes from, the highly exemplified class.
Before describing the present inventive method, a brief description of a typical feed forward neural network with which the present invention is practiced, is provided.