Modern speech recognition systems are based on principles of statistical pattern recognition and typically employ an acoustic model and a language model to decode an input sequence of observations (also referred to as acoustic events or acoustic signals) representing an input speech (e.g., a sentence or string of words).to determine the most probable sentence or word sequence given the input sequence of observations. In other words, the function of a modern speech recognizer is to search through a vast space of potential or candidate sentences and to choose the sentence or word sequence that has the highest probability of generating the input sequence of observations or acoustic events. In general, most modern speech recognition systems employ acoustic models that are based on continuous density hidden Markov models (CDHMMs). In particular, CDHMMs have been widely used in speaker-independent LVCSR because they outperform discrete HMMs and semi-continuous HMMs. In CDHMMs, the probability function of observations or state observation distribution is modeled by multivariate mixture Gaussians (also referred to herein as Gaussian mixtures) which can approximate the speech feature distribution more accurately.
In practice, contextual effects can cause significant variations with respect to the way different sounds are produced. Contextual variations of sounds can be more accurately modeled using context dependent models. In other words, to achieve good phonetic discrimination, different CDHMMs have to be trained for each different context. In general, triphone models have been used as context dependent models in which every phone has a distinct HMM model for every unique pair of left and right neighbors. The use of Gaussian mixture output distribution allows each state distribution to be modeled accurately. However, when context dependent models (e.g., triphones) are used, there is a very large number of parameters to train with little or insufficient training data. One of the early approaches to deal with this problem is to tie all Gaussian components together to form a pool which is shared among HMM states. This approach is called tied-mixture approach. In a tied-mixture system, only the mixture component weights are state-specific and they can smoothed by interpolating with context dependent models.
Recently, another approach called decision tree state tying has been used to improve the trainability of speech recognition systems and to strike a better balance between the level of detail of the phonetic models (e.g., the number of parameters in the system) and the ability to accurately estimate those parameters from the available training data. Context modeling based on decision tree state tying approach has become increasingly popular for modeling speech variations in LVCSR systems. In the conventional framework, the stochastic classifier for each tied state is trained using the Baum-Welch algorithm with the training data corresponding to the specific tied state. However, the context dependent classifiers trained using this conventional method are not very reliable because the training data corresponding to each tied state is still limited and model parameters can be easily affected by undesired sources of information such speaker and channel information contained in the training data.