Machine-learned models such as artificial neural networks typically include a number of parameters. In various machine learning techniques, the final values of the parameters are learned through an iterative training process which updates the parameters at each of a plurality of training iterations. For example, at each iteration, the performance of the model relative to a set (e.g., a “minibatch”) of training data is evaluated using a loss function. The parameters can be updated based on the performance of model as evaluated by the loss function.
The degree or amount by which the parameters of the model are updated at each iteration can be controlled by or otherwise performed in accordance with an effective learning rate. For example, a relatively smaller effective learning rate will typically result in relatively smaller changes to the values of the parameters, while a relatively larger effective learning rate will typically result in relatively larger changes to the values of the parameters at that iteration.
Stochastic gradient descent (Sgd) is one of the dominant methods used today to train deep neural networks. This method iteratively updates the parameters of a model by moving them in the direction of the negative gradient of the loss evaluated on a minibatch of training data.
Variants of Sgd that scale coordinates of the gradient by square roots of some form of averaging of the squared coordinates in the past gradients have been particularly successful, because they automatically adjust the effective learning rate on a per-feature basis. The first popular algorithm in this line of research is Adagrad which can achieve significantly better performance compared to vanilla Sgd when the gradients are sparse, or in general small.
In particular, Adagrad uses a sum of the squares of all the past gradients in the update, thereby forcing the effective learning rate at each iteration to be strictly less than or equal to the effective learning rate used at the previous iteration. Although Adagrad works well for sparse settings, its performance has been observed to deteriorate in settings where the loss functions are non-convex and gradients are dense due to rapid decay of the effective learning rate in these settings. Thus, Adagrad struggles in non-convex settings because its effective learning rate is never permitted to increase and, therefore, the gradient descent may become “stuck” at a local, but not global optimum. These problems are especially exacerbated in high dimensional problems arising in deep learning.
To tackle this issue, several other adaptive optimization techniques, such as RMSprop, Adam, Adadelta, Nadam, etc., have been proposed which mitigate the rapid decay of the effective learning rate through use of the exponential moving averages of squared past gradients, essentially limiting the reliance of the update to only the past few gradients. While these algorithms have been successfully employed in several practical applications, they have also been observed to not converge in certain settings such as sparse settings. In particular, it has been observed that in these settings some minibatches provide large gradients but only quite rarely, and while these large gradients are quite informative, their influence dies out rather quickly due to the exponential averaging, thus leading to poor convergence. Thus, Adam and other adaptive techniques that employ multiplicative updates to control the learning rate can struggle in sparse settings in which small gradients undesirably dominate the moving average.