1. Field of the Invention
The present invention relates to improving computational and statistical efficiency in a predictive model. In particular, the present invention relates to improving computational efficiency in a predictive model that is optimized using gradient and higher-order derivative-based methods, such as stochastic gradient descent or Newton's steps based optimization techniques.
2. Discussion of the Related Art
In machine learning, a predictive model is a computational model that learns a function (“target function”) from example input and output values. One type of predictive model applies a gradient descent optimization technique over an objective function. Typically, the optimization procedure involves iteratively executing the model, and then differentiating the model (i.e., calculating the first derivative of each model parameter) to adapt the values of the model parameters to minimize or maximize the objective function. The complexity of such a computation task is typically at least proportional to the size of the model. Therefore, it is desirable to have a model that is smaller, and which requires fewer computational operations.
A predictive model may be implemented, for example, in a neural network. A neural network model is usually based on a graph consisting of nodes (referred to as “neurons”), and directed, weighted edges that connect the neurons. The directed graph typically represents the function that is to be computed in the computational model. In a typical implementation, each neuron is assigned a simple computational task (e.g., a linear transformation followed by a squashing function, such as a logistic function) and the loss function (e.g., an additive inverse of the objective function) is computed over the entire neural network model. The parameters of the neural network model are typically determined (“learned”) using a method that minimizes the loss function. Stochastic gradient descent is a method that is often used to achieve the minimization. In stochastic gradient descent, optimization is achieved iteratively by (a) finding analytical gradients for the loss functions and (b) perturbing or moving the test values by a small amount in the opposite direction of the gradient, until the loss function is minimized.