Second order non-linear optimization techniques have been extensively explored for problems involving pathological curvature, such as deep neural network (DNN) training problems. A second order technique, known as Hessian-free (HF) optimization, has been demonstrated in connection with DNNs on various image recognition tasks. In addition, an HF optimization technique was applied with DNNs for speech recognition tasks. Alternatively, super linear methods, including quasi-Newton methods (e.g., Limited Memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS), Davidson-Fletcher-Powell (DFP), Symmetric Rank 1 (SR1)) have been used in connection with DNN training.
Second order methods for sequence-training of DNNs can provide, for example, a 10-20% relative improvement in word error rate (WER) over a cross-entropy (CE) trained DNN. Because sequence training uses information from time-sequential lattices corresponding to utterances, sequence training is performed using utterance randomization rather than frame randomization. For mini-batch stochastic gradient descent (SGD), which is often used for CE training, frame randomization in some cases, has been shown to perform better than utterance randomization. However, because sequence-training is accomplished at the utterance level, second order and super linear methods typically perform better than SGD, as these methods compute a gradient over a large batch of utterances compared to utterance mini-batch SGD.
HF optimization techniques for sequence training can be slow, requiring, for example, about 3 weeks for training a 300-hour Switchboard task using 64 parallel machines. There are at least two reasons why training is slow. Firstly, a great number of Krylov subspace iterations may be required for a solution to approximate the Hessian within each HF iteration (i.e., solution of the normal system of equations). Secondly, a fixed amount of data for all HF iterations in both the gradient and Krylov subspace iteration computations is used.
Accordingly, there is a need for algorithmic strategies for reduction of the amount of time spent in both gradient and Krylov subspace computations.