Technical Field
The present invention relates to optimizing network performance using dropout training, and more particularly to optimizing network performance using annealed dropout training for neural networks.
Description of the Related Art
Neural networks are computational systems based on biological neural network architecture. Neural networks may be employed in a variety of applications including, for example, document search, time series analysis, medical image diagnosis, character, speech, and image recognition, and data mining. Neural networks may include a large number of interconnected nodes, and the nodes may be separated into different layers, with the connections between the nodes being characterized by associated vector weights. Each node may include an associated function which causes the node to generate an output dependent on the signals received on each input connection and the weights of those connections.
Recently, it has been shown that neural network performance may be improved by training the neural network by randomly zeroing, or “dropping out” a fixed percentage of the inputs or outputs of a given node or layer in the neural network (e.g., dropout training) for each of one or more training sets (including a set of inputs and corresponding expected outputs) to tune network parameters (number of layers, number of nodes per layer, number of training iterations, learning rate, etc.). A reason for this improvement is that dropout training prevents the detectors in the network from co-adapting, and so encourages the discovery of approximately independent detectors, which in turn limits the capacity of the network and prevents overfitting.
In machine learning/training, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.
The possibility of overfitting may exist because the criterion used for training the model may not be the same as the criterion used to judge the efficacy of a model. In particular, a machine learned/trained model is conventionally trained by maximizing its performance on some set of training data. However, the efficacy of a model is determined by its ability to perform well on unseen data rather than its performance on the training data. Overfitting may occur when a model begins to “memorize” training data rather than “learning” to generalize from trend. As an extreme example, if the number of parameters is the same as or greater than the number of observations, a simple model or learning process may be able to perfectly predict the training data simply by memorizing the training data in its entirety, but such a model will typically fail drastically when making predictions about new or unseen data, since the simple model has not learned to generalize at all.
Conventional dropout training has been shown to improve test-time performance when there is limited data relative to the size of the model being trained. However, in data-plenty situations (which is a more usual scenario in practice), in which the size of the model and training time are the dominant constraints, conventional dropout training does not provide a practical solution to improve network performance. One reason for this is that conventional dropout training can over-constrain a network in data-plenty situations, which may result in overfitting and/or sub-optimal performance.