An inherent problem with speech recognition or speech synthesis in many languages is the fact that a given phoneme may be pronounced differently depending on its context. For example, the plosive phoneme “g” is pronounced differently in the word “gauge”. To address this problem context dependent acoustic models have been widely used.
As the number of contexts increases, the number of combinations of contexts also increases exponentially. It is almost impossible to have all possible combinations of contexts in a limited amount of training or adaptation data. To address this problem, the decision tree based context clustering technique has been used. Here similar states of HMMs are clustered into a small number of clusters using decision trees. The decision trees are usually built on maximum likelihood (ML) criteria. By traversing constructed decision trees, unseen combinations of contexts in the training data can be assigned to a leaf node of a decision tree. Model parameters are also estimated in the decision tree clustering process based on the ML criteria.
When the model is adapted to a speaker, model parameters are transformed or updated based on a criterion. Maximum likelihood linear regression or maximum a posteriori criterion is often used. To adapt general acoustic model of hidden Markov-model-based statistical parametric speech synthesis systems to a target voice characteristics, speaking styles, and/or emotions, linear transformation of model parameters (e.g. various variants of maximum-likelihood linear regressions) are used. These techniques linearly transform mean vectors and covariance matrices associated to states of hidden Markov models based on some criterion such as the maximum likelihood criterion.
In the adaptation stage, constructed decision trees are fixed and they are built from the original training data which is different to the adaptation data. If training data and adaptation data have very different context-dependency, it is not possible to model the context-dependency of adaptation data. For example, if the general model is trained by neutral voices and adaptation data is an expressive voice, to control the expressiveness, expressiveness may be modelled as contexts. However, if the general acoustic model has no expressiveness contexts, the model cannot be properly adapted to the expressive voice.