Current speech recognition systems generally have three main stages. First, the sound waveform is passed through feature extraction to generate relatively compact feature vectors at a frame rate of around 100 Hz. Second, these feature vectors are fed to an acoustic model that has been trained to associate particular vectors with particular speech units. Commonly, this is realized as a set of Gaussian mixtures models (GMMs) of the distributions of feature vectors corresponding to context-dependent phones. (A phone is a speech sound considered without reference to its status as a phoneme.) Finally, the output of these models provides the relative likelihoods for the different speech sounds needed for a hidden Markov model (HMM) decoder, which searches for the most likely allowable word sequence, possibly including linguistic constraints.
A hybrid connectionist-HMM framework replaces the GMM acoustic model with a neural network (NN), discriminatively trained to estimate the posterior probabilities of each subword class given the data. Hybrid systems give comparable performance to GMM-based systems for many corpora, and may be implemented with simpler systems and training procedures.
Because of the different probabilistic bases (likelihoods versus posteriors) and different representations for the acoustic models (means and variances of mixture components versus network weights), techniques developed for one domain may be difficult to transfer to the other. The relative dominance of likelihood-based systems has resulted in the availability of very sophisticated tools offering advanced, mature and integrated system parameter estimation procedures. On the other hand, discriminative acoustic model training and certain combination strategies facilitated by the posterior representation are much more easily implemented within the connectionist framework.
Hidden Markov model speech recognition systems typically use Gaussian mixture models to estimate the distributions of de-correlated acoustic feature vectors that correspond to individual sub-word units. By contrast, hybrid connectionist-HMM systems use discriminatively-trained neural networks to estimate the probability distribution among subword units given the acoustic observations.