In communication, data processing and similar systems, a user interface using audio facilities is often advantageous especially when it is anticipated that the user would be physically engaged in an activity (e.g., driving a car) while he/she is operating one such system. Techniques for recognizing human speech in such systems to perform certain tasks have been developed.
In accordance with one such technique, input speech is analyzed in signal frames, represented by feature vectors corresponding to phonemes making up individual words. The phonemes are characterized by hidden Markov models (HMMs), and a Viterbi algorithm is used to identify a sequence of HMMs which best matches, in a maximum likelihood sense, a respective concatenation of phonemes corresponding to an unknown, spoken utterance.
It is well known that each HMM comprises model parameters, e.g., mixture components which are characterized by Gaussian distributions. In a learning phase in a speech recognition system, the HMMs are adapted to input speech by a user to adjust to the particular speech characteristics of the user, thereby increasing accuracy of the speech recognition. In prior art, two well known approaches for adaptation of HMMs, namely, the Bayesian adaptation approach and the transformation based approach, have been employed.
According to the Bayesian adaptation approach, prior distributions are assumed for the model parameters in HMMs, and the maximum a posteriori (MAP) estimates for the model parameters are calculated. For details on this approach, one may refer to: C. Lee et al., "A study on Speaker Adaptation of the Parameters of Continuous Density Hidden Markov Models," IEEE Transactions on Signal Processing, Vol. 39, No. 4, April 1991, pp. 806-814; and J. Gauvain et al., "Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains," IEEE Transactions on Speech and Audio Processing, Vol. 2, No. 2, 1994, pp. 291-298. Since the Bayesian adaptation approach utilizes the MAP estimates based on knowledge of prior distributions, it requires less input speech data for the adaptation than, e.g., one utilizing maximum likelihood (ML) estimates which does not rely on any such knowledge.
However, if the adaptation data is scarce, the transformation based approach may be more effective than the Bayesian adaptation approach to adapt the HMMs. According to the transformation based approach, a transformation, e.g., a shift or an affine transformation, is defined in an acoustic feature space, also known as an "HMM parameter space," to explore correlations between different HMMs, and such correlations help adapt the HMMs despite the scarcity of the adaptation data. Parameters characterizing the transformation are estimated using the adaptation data. In implementing the transformation based approach, it is desirable to divide the acoustic feature space into a number of subspaces and estimate the transformation parameters in each subspace. However, the performance of speech recognition using the transformation based approach does not significantly improve with an increasing amount of adaptation data as any improvement is restricted by the limited number of variable transformation parameters used in the approach.
An attempt to combine the Bayesian adaptation approach with the transformation based approach to improve the speech recognition performance has been made. This attempt is described in: Chien et al., "Improved Bayesian Learning of Hidden Markov Models for Speaker Adaptation," ICASSP-97, 1997, pp. 1027-1030. However, the success of such an attempt relies on the requirement that the number of subspaces in the acoustic feature space be optimized for various amounts of adaptation data, which is usually impractical.
Accordingly, there exists a need for combining the Bayesian adaptation approach with the transformation based approach in a feasible manner to improve the speech recognition performance, regardless of the amount of available adaptation data.