1. Field of the Invention
The present invention relates to the field of speech recognition and, more particularly, to a method of model adaptation for noisy speech recognition.
2. Description of Related Art
In a conventional automatic speech recognition system, as shown in FIG. 2, speech signals of time domain, denoted by {xt}, are entered for executing an end point detection and feature extraction process to determine the speech and background noise, so as to extract the desired speech signals. Then, the extracted speech signals are applied for executing a pattern matching process with respect to speech reference models 21 to produce possible results, and, finally, a decision rule is applied to the possible results so as to obtain the recognition results, as denoted by {Wn}.
Generally, the speech reference models 21 are preferred to be the well-known Hidden Markov Models (HMMs). Such statistic models represent the relevant feature distribution and time-variable transformation characteristics of the speech spectrum. In order to have reliable statistic models, it is required to record speech data from a great number of people before performing the process of training the model parameters. In such a speech data collecting process, the recording of speech data is generally performed in an ideal quiet environment, so as to obtain statistic models indicative of a noiseless environment. However, in practical application, it is impossible to have a completely noiseless environment. On the contrary, noise exists everywhere and anytime in the environment. Furthermore, the types of noise and intensity thereof are not predictable. As such, noise is likely to add an extra spectral component in the original clean speech signals, which results in degrading the speech recognition rate significantly.
As well known to those skilled in the art, a better speech recognition rate can be achieved if the environmental factors of training speech data and speech to be recognized are matched, and a description of such can be found in Juang, B. H. xe2x80x9cSpeech recognition in adverse environmentsxe2x80x9d, Computer Speech and Language 5, pp. 275-294, 1991, and is hereby incorporated herein by reference. Therefore, it is possible to improve the recognition rate of noisy speech by using the speech data with the same noise as that of the noisy speech to train the statistic models. Although it is theoretically possible to train model parameters again when the environmental noise is changed, it can hardly be achieved in practical applications. One major reason is that the required speech database is relatively large, and thus the cost of a speech recognizer with such a database is too high. Furthermore, the computation amount is large and the time required to train parameters is long, so that the requirement of dynamical adaptation based on the change of the environment is difficult to achieved. Therefore, efforts are devoted to having noisy speech statistic models without involving a repetitive training process. As known, in the HMMs, the speech probability density is the parameter that is most susceptible to be influenced by external noise. Therefore, the speech recognition rate can be significantly improved if the speech probability density function is adjusted to match with the noise condition of the test utterance. However, the speech density is generally expressed in the cepstral domain, while the effect of noise is of an accumulation in the linear spectral domain. As a result, it is theoretically impossible to adjust the speech probability density function directly in the cepstral domain.
To eliminate the aforementioned problem, a Parallel Model Combination (PMC) method is proposed to combine the statistical data of speech and noise in the linear spectral domain by means of transformation between cepstral domain and linear spectral domain, thereby obtaining the cepstral means and variances of the noisy speech. The description of such a PMC method can be found in Gales, M. J. F. and Young, S. J. xe2x80x9cCepstral parameter compensation for HMM recognition in noisexe2x80x9d, Speech Communication 12, pp. 231-239, 1993, which is hereby incorporated by reference into this patent application. Accordingly, speech models can be adjusted based on the change of the environmental noise by detecting the background noise in the speech inactive period and determining the statistical data of noise.
FIG. 3 shows an automatic speech recognition system utilizing such a PMC method. As shown, speech signals, denoted by {xt}, are entered to execute an end point detection and feature extraction process for determining the background noise and obtaining extracted speech signals. The background noise is provided for noise model estimation. The estimation results and the reference speech models 21 are applied together for PMC adaptation to obtain adapted speech models 31 that is varied according to the change of the environmental noise. Then, the extracted speech signals are applied for executing a pattern matching process with respect to the adapted speech models 21 to produce possible results, and, finally, determine the recognition results {Wn}.
In executing the above PMC method, for simplicity of expression, it is assumed that the speech probability density function is represented by a Gaussian function ƒ(x|xcexcc, xcexa3c), where x represents a cepstral observation vector, xcexcc represents a cepstral mean vector, and xcexa3c represents a cepstral covariance matrix. The method first transforms the xcexcc and xcexa3c of the speech model from the cepstral domain to the log-spectral domain by performing inverse discrete cosine transform (IDCT) operations as follows:
xcexcl=Cxe2x88x921xcexcc  and
xcexa3l=Cxe2x88x921xcexa3c(Cxe2x88x921)T,
where the superscript l indicates the parameter in the log-spectral domain, Cxe2x88x921 is a matrix for IDCT, and the superscript T indicates the transposed matrix. Each component of the mean vector and covariance matrix can be obtained as follows:
xcexci=exp(xcexcij+"sgr"iil/2) and
"sgr"ij=xcexcixcexcj[exp("sgr"ijl)xe2x88x921].
After the mean vectors and covariance matrices of speech and noise are respectively obtained, the corresponding statistic of noisy speech can be obtained by performing parameter combination operations as follows:
{circumflex over (xcexc)}i=gxcexci+{tilde over (xcexc)}i  and
{circumflex over ("sgr")}ij=g2"sgr"ij+{tilde over ("sgr")}ij,
where g is a scaling factor that provides the power matching between the training data and the test utterance, {tilde over (xcexc)}i is the ith noise component, and {tilde over ("sgr")}ij is the ijth variance component. Thereafter, the log-spectral mean vector and variance of the noisy speech can be obtained by taking the inverse transformation as follows:
{circumflex over (xcexc)}il=log({circumflex over (xcexc)}i)xe2x88x920.5{circumflex over ("sgr")}iil  and
            σ      ^        ij    l    =            log      ⁢              (                                                            σ                ^                            ij                                                                        μ                  ^                                i                            ⁢                                                                                          xe2x80x83                                        ⁢                    μ                                    ^                                j                                              +          1                )              .  
Finally, the cepstral mean vector and covariance matrix of noisy speech can be obtained by taking the discrete cosine transform (DCT) as follows:
{circumflex over (xcexc)}c=C{circumflex over (xcexc)}l  and
{circumflex over (xcexa3)}c=C{circumflex over (xcexa3)}lCT.
From the aforementioned process, it is known that the noisy speech models can be obtained in using the PMC method by estimating the statistic of the background noise in the speech inactive period, so as to decrease the computation amount. However, in practice, the actual computation amount to adjust all the probability density functions in using the PMC method is still relatively huge, especially when the number of models is large. In order to effectively reduce the time for model adaptation, an improved PMC method is proposed to reduce the number of PMC processing times by introducing the distribution composition with the spatial relation of distributions. The description of such an improved PMC method can be found in Komori, Y, Kosaka, T., Yamamoto, H., and Yamada, M. xe2x80x9cFast parallel model combination noise adaptation processingxe2x80x9d, Proceedings of Eurospeech 97, pp. 1523-1526, 1997, which is hereby incorporated herein for reference. Furthermore, a published document, Vaseghi, S. V. and Milner, B. P. Noise-Adaptive hidden Markov models based on Wiener filters Proceedings of Eurospeech 93, pp. 1023-1026. 1993, incorporated herein for reference, is provided to reduce the computation amount of the PMC method by simply adapting the mean vectors without adjusting the variances. These methods use fewer adaptation parameters than those of the original PMC method, and thus the recognition rate for noisy speech is not satisfactory. Therefore, it is desirable to provide an improved speech recognition method to mitigate
and/or obviate the aforementioned problems.
It is an object of the present invention to provide a method of model adaptation for noisy speech recognition, which is able to perform an adaptation process with a relatively low computation amount, while maintaining a sufficient speech recognition rate.
To achieve the object, the present invention provides a method of model adaptation for noisy speech recognition to determine the cepstral mean vector and covariance matrix of adapted noisy speech from cepstral mean vectors and covariance matrices of speech and noise. The method first transfers the cepstral mean vectors of noise and speech into linear spectral domain, respectively. Then, the method combines the linear spectral mean vectors of noise and speech to obtain a linear spectral mean vector of noisy speech. Next, the method transfers the linear spectral mean vector of noisy speech from linear spectral domain into cepstral domain, so as to determine the cepstral mean vector of adapted noisy speech. Finally, the method multiplies the cepstral covariance matrices of speech and noise by a first and a second scaling factor, respectively, and combines the multiplied cepstral covariance matrices together, so as to determine the cepstral covariance matrix of adapted noisy speech.
The above and other objects, features and advantages of the present invention will become apparent from the following detailed description taken with the accompanying drawings.