Speech recognition systems convert input audio including speech to recognized text. During recognition, acoustic waveforms are typically divided into a sequence of discrete time vectors (e.g., 10 ms segments) called “frames,” and one or more of the frames are converted into sub-word (e.g., phoneme or syllable) representations using various approaches. In a first approach, input audio is compared to a set of templates and the sub-word representation for the template in the set that most closely matches the input audio is selected as the sub-word representation for that input. In a second approach, statistical modeling is used to convert input audio to a sub-word representation (e.g., to perform acoustic-phonetic conversion). When statistical modeling is used, acoustic waveforms are processed to determine feature vectors for one or more of the frames of the input audio, and statistical models are used to assign a particular sub-word representation to each frame based on its feature vector.
Hidden Markov Models (HMMs) are statistical models that are often used in speech recognition to characterize the spectral properties of a sequence of acoustic patterns. For example, acoustic features of each frame of input audio may be modeled by one or more states of an HMM to classify the set of features into phonetic-based categories. Gaussian Mixture Models (GMMs) are often used within each state of an HMM to model the probability density of the acoustic patterns associated to that state. Artificial neural networks (ANNs) may alternatively be used for acoustic modeling in a speech recognition system. Such ANNs may be trained to estimate the posterior probability of each state of an HMM given an acoustic pattern. Some statistical-based speech recognition systems favor the use of ANNs over GMMs due to better accuracy in recognition results and faster computation times of the posterior probabilities of the HMM states.