A feedforward, artificial neural network uses layers of non-linear “hidden” units between its inputs and its outputs. Each unit has a weight that is determined during learning, which can be referred to as a training stage. In the training stage, a training set of data (a training set of inputs each having a known output) is processed by the neural network. Thus, it is intended that the neural network learn how to provide an output for new input data by generalizing the information it learns in the training stage from the training data. Generally, once learning is complete, a validation set is processed by the neural network to validate the results of learning. Finally, test data (i.e., data for which generating an output is desired) can be processed by a validated neural network.
The purpose of learning is to adapt the weights on the incoming connections of hidden units to learn feature detectors that enable it to predict the correct output when given an input vector. If the relationship between the input and the correct output is complicated and the network has enough hidden units to model it accurately, there will typically be many different settings of the weights that can model the training set almost perfectly, especially if there is only a limited amount of labeled training data. Each of these weight vectors will make different predictions on held-out test data and almost all of them will do worse on the test data than on the training data because the feature detectors have been tuned to work well together on the training data but not on the test data.
This occurs because of the overfitting problem, which occurs when the neural network simply memorizes the training data that it is provided, rather than generalizing well to new examples. Generally, the overfitting problem is increasingly likely to occur as the complexity of the neural network increases.
Overfitting can be mitigated by providing the neural network with more training data. However, the collection of training data is a laborious and expensive task.
One proposed approach to reduce the error on the test set is to average the predictions produced by many separate trained networks and then to apply each of these networks to the test data, but this is computationally expensive during both training and testing.
It is an object of the following to obviate or mitigate at least one of the foregoing issues.