Acoustic modeling techniques that use context-dependent deep neural network hidden Markov models (CD-DNN-HMMs) for speech recognition or speech-to-text transcription can outperform acoustic modeling techniques that use conventional Gaussian-mixture based HMMs. Unlike Gaussian-mixture based HMMs, CD-DNN-HMMs use artificial neural networks with multiple hidden layers to directly model tied context-dependent states. However, the training of CD-DNN-HMMs for use in speech recognition is generally more time consuming that the training of Gaussian-mixture based HMMs. This larger amount of training time for CD-DNN-HMMs is a major obstacle to the widespread adoption and use of CD-DNN-HMMs for speech recognition.
The training of conventional Gaussian-mixture based HMMs for speech recognition may be optimized via parallelization. For example, the Baum-Welch training of Gaussian-mixture based HMMs may include statistics collection that is parallelized over hundreds or even thousands of servers. In such training, speech utterances may be processed independently across multiple servers. At the end of a batch of hundreds of millions of frames, partial statistics from the servers may be merged, and an updated model may be distributed to the servers. However, techniques for training Gaussian-mixture based HMMs are inapplicable to the training of CD-DNN-HMMs due to differences in model type, training procedures, and computation resource usage.