A feedforward, artificial neural network uses layers of non-linear “hidden” units between its inputs and its outputs. Each unit has a weight vector that is determined during learning, which can be referred to as the training stage. In the training stage, the feature vectors are first initialized by a pre-determined random or pseudo-random algorithm. After that, a training set of data (a training set of inputs each having a known output) is used by a learning algorithm to adjust the feature vectors in the neural network. It is intended that the neural network learn how to provide an output for new input data by generalizing the information it learns in the training stage from the training data. Generally, during different stages of training, a validation set is processed by the neural network to validate the results of training and to select hyper-parameters used for training. Finally, test data (i.e., data for which generating an output is desired) can be processed by a validated neural network. The purpose of training is to adapt the weights on the incoming connections of hidden units to learn feature detectors that enable it to predict the correct output when given an input vector. If the relationship between the input and the correct output is complicated and the network has enough hidden units to model it accurately, there will typically be many different settings of the weights that can model the training set almost perfectly, especially if there is only a limited amount of labeled training data. Each of these weight vectors will make different predictions on held-out test data and almost all of them will do worse on the test data than on the training data because the feature detectors have been tuned to work well together on the training data but not on the test data.
This is caused by the overfitting problem, which occurs when the neural network is encouraged to memorize the training data that it is provided, rather than learning the relationship between the output and the input vector that can generalize to new examples. Generally, the overfitting problem is increasingly likely to occur as the complexity of the neural network increases.
Instead of relying on a single model comprising a single neural network to generate an output, an ensemble of models, comprising a plurality of neural networks can be provided to generate an output. The average prediction (i.e., the average of the outputs) of the plurality of neural networks in the ensemble tends to outperform the prediction of individual neural networks in the ensemble. Because different models in an ensemble tend to make different errors on test data, averaging models trained with different random initializations, different architectures, different hyper-parameter settings and different subsets of data usually yield improved accuracy and often yield a substantial improvement. Most machine learning challenges, such as the Netflix™ prize, have been won using ensemble approaches. These improvements may come at the expense of the additional engineering effort and additional computational resources that are required to train independent models.