Deep neural networks (DNNs) are gaining acceptance in automatic speech recognition (ASR) by allowing performance improvements previously unseen in state-of-the-art systems. However, new challenges arise from using DNNs in ASR. Finding the best procedure to train DNNs is an active area of research that is rendered more challenging by the availability of ever more training data.
A component of the DNN training procedure is sequence training (ST), where the network parameters are optimized under a sequence classification criterion such as Minimum Phone Error (MPE). MPE training of DNNs is an effective technique for reducing word error rate (WER) of ASR tasks. This training is often carried out using a Hessian-free (HF) quasi-Newton approach, although other methods such as stochastic gradient descent (SGD) have also been successfully applied to training HF sequence training (HFST) uses a cross-entropy (CE) trained DNN as starting point, and is run until convergence, which is usually a computationally costly proposition.