1. Field of the Invention
The present invention relates to improving resilience in a computational environment that is based on programs that learn, such as a neural network model. In particular, the present invention relates to introducing noise into a learning program to improve resilience to overfitting data, and to avoid getting stuck in local optima.
2. Discussion of the Related Art
Learning programs, such as neural networks, have been used to uncover hidden information inherent in data. The uncovered hidden information allows that data to be subsequently analyzed for a variety of purposes, such as classification or for use in decision making. A neural network model is usually based on a graph consisting of nodes that are referred to as “neurons” and directed, weighted edges connecting the neurons. When implemented in a computational environment, the directed graph of the neural network model typically represents a function that is computed in the computational environment. In a typical implementation, each neuron is assigned a simple computational task (e.g., a linear transformation followed by a squashing function, such as a logistic function) and a loss function is computed over the entire neural network model. The parameters of the neural network model are typically determined (“learned”) using a method that involves minimizing the loss function. A large number of techniques have been developed to minimize the loss function. One such method is “gradient descent,” which is computed by finding analytical gradients for the loss functions and perturbing or moving the test values by a small amount in the direction of the gradient.
One specialized neural network model, called an autoencoder, has been gaining adherents recently. In the autoencoder, the function that is to be learned is the identity function, and the loss function is a reconstruction error computation on the input values themselves. One technique achieves effective learning of a hidden structure in the data by requiring the function to be learned with fewer intermediate neurons than the values in the input vector itself. The resulting neural network model may then be used in further data analysis. As an example, consider the data of a 100×100 pixel black-and-white image, which may be represented by 10000 input neurons. If the intermediate layer of the computation in a 3-layer network is constrained to having only 1000 neurons, the identity function is not trivially learnable. However, the resulting connections between the 10000 input neurons and the 1000 neuron in the hidden layer of the neural network model would represent in some extent the interesting structure in the data. Once the number of neurons in such an intermediate layer begins to approach 10000 then the trivial identity mapping becomes a more likely local optimum to be found by the training process. The trivial identity mapping, of course, would fail to discover any hidden structure of the data.
An interesting technique to allow a large number of intermediate neurons to be used is the “denoising autoencoder.” In a denoising autoencoder, the input values are distorted, but the network is still evaluated based on its ability to reconstruct the original data. This makes the identity function not usually a good local optimum, and thereby allows a larger hidden layer (i.e., with more neurons) to be available to learn more relationships inherent in the data.