The goal of human speech production is to convey discrete linguistic symbols corresponding to the intended message, while the actual speech signal is produced by the continuous and smooth movement of the articulators with lots of temporal structures. This seemingly contradictory dual nature (discrete vs. continuous) of speech can be amazingly utilized by human speech recognizers in a beneficial way to enhance the decoding of the underlying message from acoustic signals. However, so far this has been a serious challenge for acoustic modeling in both scientific research and practical applications.
The conventional hidden Markov models (HMMs) used in the state-of-the-art speech technology, albeit putting enough emphasis on the symbolic nature of speech, have long been recognized to model the temporal dynamics very poorly, which result in some inherent weaknesses of the current speech technology built upon it. Efforts have since been made to improve the modeling of temporal dynamics and the ultimate goal is to turn the coarticulation behavior in natural speech from a curse (as in current speech technology) to a blessing. Currently there are two general trends in the speech research community to reach this goal: one is to extend upon HMM to better account for the temporal dynamics in acoustic signals directly, the other is to use some kind of hidden dynamics, abstract or physically meaningful, to account for the temporal dynamics and subsequently map it to the acoustic domain. The HMM extensions typically enjoy the benefit of being able to use the standard HMM training and test algorithms with some generalization, but have more model parameters and need more computation. The temporal dynamics at the surface acoustic level is also very noisy and difficult to extract. The hidden dynamic models (HDMs) are able to directly model the underlying dynamics with a parsimonious set of parameters and closer to the models developed in speech science, but they typically require the derivation of new training and test algorithms with various degrees of difficulty.
By way of additional background, in speech recognition systems, an input speech signal is converted into words that represent the verbal content of the speech signal. This conversion begins by converting the analog speech signal into a series of digital values. The digital values are then passed through a feature extraction unit, which computes a sequence of feature vectors based on the digital values. Each feature vector represents a section of the speech signal.
The feature vectors can represent any number of available features extracted through known feature extraction methods such as Linear Predictive Coding (LPC), LPC-derived cepstrum, Perceptive Linear Prediction (PLP), auditory model, and Mel-Frequency Cepstrum Coefficients (MFCC).
The feature vectors are applied to an acoustic model that describes the probability that a feature vector was produced by a particular word, phoneme, or senone. Based on a sequence of these probabilities, a decoder identifies a most likely word sequence for the input speech signal.