Machine learning automates the creation, based on historical data, of models that can then be used to make predictions. A class of models called deep neural networks (or DNNs) has become popular over the last few years, and there is now a menagerie of types of DNNs. Some examples of DNN's include feedforward, convolutional, recurrent, long-short term memory (LSTM), and Neural Turing Machines (NTM). As is also the case for most other types of models, DNNs are sufficiently expressive in that they can easily overfit data, i.e., model some of the uninformative noise in the input data in addition to the informative signal.
One recent technique for mitigating overfitting in neural networks is a method known in industry as “Dropout.” Two papers that describe Dropout are Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research 15 (2014) 1929-1958; and Geoffrey E Hinton et al., “System and Method for Addressing Overfitting in a Neural Network,” Patent Cooperation Treaty Publication WO2014105866 A1, Jul. 3, 2014.
To understand Dropout, one should first review the structure of a neuron within a typical neural network. A neural network includes a graph, or hypergraph of neurons, Ni. This graph includes a set of input-stage neurons (input neurons), a set of output-stage neurons (output neurons), and a set of intermediate neurons between the input and output stage neurons. The intermediate neurons are typically referred as hidden neurons, as they are interior neurons shielded from the input and output periphery of the neural network. A collection of inputs and a function, fi, are associated with each neuron. Typically, each fi is a non-linear function of the dot product of a set of weights, Wi,j with the values, Vi,j, of the inputs. For example, a sigmoid function (such as tanh) can be used for each of the non-linear functions fi, leading to fi=tanh(Σi,j(Wi,j*Vi,j)).
Supervised training of a neural network determines each weight coefficient Wi,j, usually by providing a series of pairs, (Xk, Yk), to the neural network. One of the xϵXk is supplied to the primary inputs, and the corresponding yϵYk is used at the primary outputs. Initially, a disparity between the actual y value and the value generated by the network will likely exist. This disparity between y and the value produced by the network being trained is used to drive techniques, such as backpropagation, stochastic gradient descent, and the like, to update the weight coefficients Wi,j.
Dropout is a modification of the training procedure in which a newly selected, random fraction a of the hidden neurons are eliminated from the neural network (i.e., a fraction of the interior neuron output values Vi,j are temporarily set to 0) each time a training datum is presented to the network to update the weight coefficients Wi,j. Typically, α is 0.5 in practice. Since any value multiplied by zero is always zero and the magnitude of any weight Wi,j that is to be multiplied by a zeroed value is irrelevant, the dot product is scaled up during Dropout training to compensate. For example, suppose that the sum of the weights for neuron q isWq=Σinputs j of q(Wq,j),and for the current training datum, the sum of the weights for the zeroed values among neuron q's inputs is Zq. Then, temporarily multiplying the dot product Σj(Wq*Vq,j) by (Wq/(Wq−Zq)) in the computation of fq compensates for the zeroing of some of the values by effectively treating the zeroed values as preserving the average behavior of the values that were not zeroed.
Existing Dropout techniques aim to avoid overfitting, e.g., by performing a type of bagging. See Leo Brieman, “Bagging Predictors,” Machine Learning 24 (2): 123-140, 1996. That is, Dropout can be seen as averaging 2n different neural networks, or sub-networks, where the exponent “n” is the number of weights in the network being trained, with every sub-network including a subset of the weights. Essentially, the Dropout scheme incorporates as many models as in the powerset of the number of weights in the original network; hence, 2n models are incorporated. The averaging a collection of weak models of produces a stronger model than any single member of the collection and the quality of prediction increases with the number of models contributing to the collective decision.