Artificial neural networks (ANNs) are typically designed and constructed as a multilayer structure having one or more hidden layers of artificial neurons (ANs) and an output layer of ANs as shown in FIG. 1. Most commonly, each layer consists of a number of ANs and each AN comprises a node and a group of weighted input receptors or synapses. The weighted receptors couple the output of lower level ANs to the nodes of the next higher layer through a weighting element that modifies the amplitude of the input signal to the receptor. The node sums the weighted receptor inputs and applies the sum to a monotonically increasing nonlinear element that produces the output signal of the node. Thus, the number of weighted receptors in any given layer may be equal to the product of the number of ANs (or nodes) and the number of nodes in the next lower layer. The total number of weighting elements may be as large as KN.sup.2, where K is the number of layers and N is the average number of nodes per layer. A modest ANN that accommodates an input vector with 100 component signals could have as many as 10,000 weighting elements per layer. This would not be an excessively large network compared to networks reported to have 6.5.times.10.sup.6 weights with 2600 input parameters. A central problem in the design of neural networks is minimizing the complexity of the network so that the resulting ANN complexity is consistent with the training data available. Complexity management translates into minimizing the number of weights. It is well-known that without such network "pruning" a mismatch between the number weights used in the network and the "degrees of freedom" (complexity) will result in sub-optimal performance. Having too many weights (high complexity) leads to overfitting and hence poor generalization. Conversely, if too few weights are used (insufficient complexity), the network's ability to learn will be impaired. Generalization is optimized by trading off network complexity and training error. Unfortunately, no simple method for determining the required degree of complexity, prior to a trial training session, is available.
Consequently, even when an ANN is trained and found to function adequately on the available training data, it is not apparent that the complexity of the ANN is commensurate with the objective. Elimination of weights and/or ANs and retraining suggests itself. However, it is not clear how such pruning is to be done. In other words an efficient rational pruning method is needed as a guide to reducing the complexity of a trained ANN.
Hertz et al. have proposed a pruning method based on removing weights having the smallest magnitude (Hertz, J. Krogh, A., and Palmer, R. G. "Introduction to the theory of Neural Computation," Addison Wesley 1991). This idea unfortunately often leads to the elimination of the wrong weights in high order ANNs because the combined affect of low magnitude weights may be essential for low error.
A method for reducing ANN size by selectively deleting weights that is applicable to large scale networks has been described by LeCun et al. (LeCun, Y., Decker, J. S., and Solla, S. A., (1990) "Optimal Brain Damage", Neural Information Processing Systems, Vol. 2, Touretzky, D. S. (ed.) pp. 598-605, Morgan-Kaufmann). The method is based on a theoretical measure using the second derivative of the objective function with respect to the weighting parameters for computing the saliencies. Saliency is defined to be the change in objective function caused by deleting that parameter. Because it is prohibitively laborious to evaluate the saliency directly by deleting each parameter and evaluating the objective function, an analytical prediction of saliency is used based on a Taylor series expansion of the objective error function. A perturbation of the weight vector w by .delta.w causes a perturbation in value of the error objective function or metric, E, corresponding to E+.delta.E where ##EQU1## and .delta.w.sub.k, an element of .delta.w, the perturbation introduced into the weight w.sub.k of vector w.
By assuming that only the diagonal terms (i=j) are significant, LeCun et al. arrive at the simplified saliency measure ##EQU2## Thus, by analytically determining the values of h.sub.ij =.differential.E/.differential..sup.2 w.sub.i, the effect of any change in weights, .delta.w.sub.i, on the objective, .delta.E, may be estimated from the known activation function characteristics of the ANs used, and from the topology of the ANN.
The simplifying assumption made by LeCun et al. may lead to sub-optimal results if the off-diagonal terms (h.sub.i,j ; j.intg.i) are significant.
Another approach requires that statistical tests be performed on weights during training. If a weight is not "statistically significant from zero," then that weight is to be eliminated or reduced in value (decayed) (Hertz et al., op. cit.). Unfortunately, the statistical tests are performed during training, and it is unclear how the statistics of the weight during training relates to the final trained network.
For example, a set of benchmark problems were used in a recent competition in machine learning (Thrun, S. B., and 23 co-authors (1991), "The MONK's Problems--A Performance Comparison of Different Learning Algorithms", CMU-CS-91-197 Carnegie-Mellon Univ. Dept. of Comp. Science Tech, Repo.), the most successful method was back propagation using weight decay, which yielded a network with 58 weights for one MONK's problem. The method of the present invention requires only 14 weights for the same performance and on two other MONK's problems, resulted in a 62% and 90% reduction in the number of weights.