Robust and accurate speech recognition systems today can only be realized with adequately trained acoustic models. State-of-the-art systems are now trained using thousands of hours of speech data. However, the entire training process can take many weeks. Although existing training techniques that utilize Hidden Markov Models work well in training speech recognition models, there are on-going efforts to further improve the efficiency of such training.
There have been a number of efforts over the past decades to reduce the time required to train Hidden Markov Models for speech recognition. In 1990, Pepper et al. experimented with performing training on a set of computers organized in a ring. In 1992, Foote et al. introduced an approach to distribute Hidden Markov Model training to a set of five loosely-coupled Armstrong II multi-processor network computers. In 1997, Yun et al. mapped the training algorithm to a field-programmable gate array infrastructure. And in 2006, Poprescu et al. implemented acoustic model training on a message passing interface-based cluster with three nodes. These prior works all achieved less than 3× speedup over sequential runs and thus have not been widely used.
In a separate field, Liu implemented training of discrete Hidden Markov Models on graphics processing units. In particular, Liu developed the implementation to be effective for applications such as biological sequence analysis. This generic training engine, however, is not appropriate for acoustic model training as it 1) is unable to handle continuous observation models, and 2) cannot take advantage of the special left-right model structure used in speech recognition. Further, in Dixon et al., techniques for fast acoustic likelihood computation were introduced in the context of a speech recognition decoder, but did not extend the work to the training process. Additionally, Pangborn constructed an efficient implementation on the graphics processing unit for flow cytometry used in biology and immunology, but this approach only trains a single Gaussian mixture model and is thus unsuitable for acoustic model training.