1. Field
The following description relates to speech recognition technologies.
2. Description of Related Art
A speech recognition engine of an electronic device or server is generally composed of an acoustic model, a language model, and a decoder. The acoustic model may be a static model that outputs probabilities of phonemes and pronunciations of an input audio signal based on such pronunciations and connectivity of the same. The language model is a static model that may independently output information associated with phonemes, pronunciations, words, sentences, and the like, based on an independently trained or directed connectivity of the same. The decoder decodes the output of the acoustic model and the language model to return a final recognition result of the input audio signal based on the outputs of the acoustic model and the language model. A Gaussian Mixture Model (GMM) has generally been used in the past for the acoustic model, but recently, speech recognition performance has been improved by using a Deep Neural Network (DNN) acoustic model. As noted, such speech recognition techniques use acoustic and language models that have been trained independently of each other. Still further, a Viterbi decoding scheme has typically been used in the acoustic model.