Speech recognition has been the subject of a significant amount of research and commercial development. For example, speech recognition systems have been incorporated into mobile telephones, desktop computers, automobiles, and the like in order to provide a particular response to speech input provided by a user. For instance, in a mobile telephone equipped with speech recognition technology, a user can speak a name of a contact listed in the mobile telephone and the mobile telephone can initiate a call to the contact. Furthermore, many companies are currently using speech recognition technology to aid customers in connection with identifying employees of a company, identifying problems with a product or service, etc.
Even after decades of research, however, the performance of automatic speech recognition (ASR) systems in real-world usage scenarios remains far from satisfactory. Conventionally, Hidden Markov Models (HMMs) have been the dominant technique for large vocabulary continuous speech recognition (LVCSR). An HMM is a generative model in which the observable acoustic features are assumed to be generated from a hidden Markov process that transitions between states S={s1, . . . , SK}. The key parameters in the HMM are the initial state probability distribution π={qt=sj|qt-1=si}, where qt is the state at time t, the transition probabilities aij=p(qt=sj|qt-1=si), and a model to estimate the observation probabilities p(xt|si).
In conventional HMMs used for ASR, the observation probabilities are modeled using Gaussian Mixture Models (GMMs). These GMM-HMMs are typically trained to maximize the likelihood of generating the observed features. Recently, various discriminate strategies and large margin techniques have been explored. The potential of such techniques, however, is restricted by limitations of the GMM emission distribution model.
Attempts have been made to extend the conventional GMM-HMM architecture so that discriminative training becomes an inherent part of the model. For example, the use of artificial neural networks (ANNs) has been proposed to estimate observations probabilities. Such models have been referred to as ANN-HMM hybrid models and were, in the recent past, viewed as a promising technique for LVCSR. Such hybrids, however, have been associated with various limitations. For instance, using only backpropagation to train a feed-forward ANN does not exploit more than two hidden layers well. Accordingly, given the deficiencies in conventional ASR systems, improved ASR systems are desirable.