1. Field of the Invention
The present invention relates to the field of speech recognition and, more particularly, to an adaptive speech recognition method with noise compensation.
2. Description of Related Art
It is no doubt that the robustness issue is crucial in the area of pattern recognition because, in real-world applications, the mismatch between training and testing data may occur to severely degrade the recognition performance considerably. For such a speech recognition problem, the mismatch comes from the variability of inter- and intra-speakers, transducers/channels and surrounding noises. For instance, considering the application of speech recognition for hands-free voice interface in a car environment, the non-stationary surrounding noises of engine, music, babble, wind, echo under different driving speeds will vary and hence deteriorate the performance of the recognizer.
To solve the problem, a direct method is to collect enough training data from various noise conditions to generate speech models, such that proper speech models can be selected based on the environment of a specific application. However, such a method is impractical for the application in a car environment because of the complexity of noise and the tremendous amount of training data to be collected. In addition, the method requires additional mechanism to detect the change in the environment, and such environmental detector is difficult to design.
Alternatively, a feasible approach is to build an adaptive speech recognizer where the speech models can be adapted to new environments using environment-specific adaptation data.
In the context of statistical speech recognition, the optimal word sequence W of an input utterance X={xt} is determined according to the Bayes rule:
Ŵ=argwmax p(W|X)=argwmax p(X|W)p(W),xe2x80x83xe2x80x83(1)
where p(X|W) is the occurrence probability of X when the word sequence of X is W, and p(W) is the occurrence probability of word W based on the prior knowledge of word sequence. The description of such a technique can be found in RABINER, L. R.: xe2x80x98A tutorial on hidden Markov models and selected applications in speech recognitionxe2x80x99, Proceedings of IEEE, 1989, vol. 77, pp. 257-286, which is incorporated herein for reference. Using a Markov chain to describe the change of the feature of the speech parameters, the p(X|W) can be further expressed, based on the HMM (Hidden Markov Model) theory, as follows:                                           p            ⁢                          xe2x80x83                        ⁢                          (                              X                |                W                            )                                =                                                    ∑                                  all                  ⁢                                      xe2x80x83                                    ⁢                  S                                                  xe2x80x83                                            ⁢                              xe2x80x83                            ⁢                              p                ⁢                                  xe2x80x83                                ⁢                                  (                                      X                    ,                                          S                      |                      W                                                        )                                                      =                                          ∑                                  all                  ⁢                                      xe2x80x83                                    ⁢                  S                                                  xe2x80x83                                            ⁢                              xe2x80x83                            ⁢                              p                ⁢                                  xe2x80x83                                ⁢                                  (                                                            X                      |                      S                                        ,                    W                                    )                                ⁢                                  xe2x80x83                                ⁢                p                ⁢                                  xe2x80x83                                ⁢                                  (                                      S                    |                    W                                    )                                                                    ,                            (        2        )            
where S is the state sequence of the speech signal X.
In general, the computations of (1) and (2) are very expensive and almost unreachable because all possible S must be considered. One efficient approach is to apply the Viterbi algorithm and decode the optimal state sequence Ŝ={ŝt}, as described in VITERBI, A. J.: xe2x80x98Error bounds for conventional codes and an asymptotically optimal decoding algorithmxe2x80x99, IEEE Trans. Information Theory, 1967, vol. IT-13, pp. 260-269, which is incorporated herein for reference. As such, the summation over all possible state sequences in (2) is accordingly approximated by the single most likely state sequence, i.e.                                                         p              ⁢                              xe2x80x83                            ⁢                              (                                  X                  |                  W                                )                                      ≅                          p              ⁢                              xe2x80x83                            ⁢                              (                                                      X                    |                                          S                      ^                                                        ,                  W                                )                            ⁢              p              ⁢                              xe2x80x83                            ⁢                              (                                                      S                    ^                                    |                  W                                )                                              =                                    π                                                s                  ^                                0                                      ⁢                          xe2x80x83                        ⁢                                          ∏                                  t                  =                  1                                T                            ⁢                              xe2x80x83                            ⁢                                                a                                                                                    s                        ^                                                                    i                        -                        1                                                              ⁢                                                                  s                        ^                                            t                                                                      ⁢                                  b                                                            s                      ^                                        t                                                  ⁢                                  xe2x80x83                                ⁢                                  (                                      x                    t                                    )                                                                    ,                            (        3        )            
where xcfx80ŝo is the initial state probability, aŝrxe2x88x92lŝt is the state transition probability and bŝt (xt) is the observation probability density function of xt in state ŝt, which is modeled by a mixture of multivariate Gaussian densities; that is:                                           b                                          s                ^                            i                                ⁢                      xe2x80x83                    ⁢                      (                          x              t                        )                          =                              p            ⁢                          xe2x80x83                        ⁢                          (                                                                                          x                      t                                        |                                                                  s                        ^                                            t                                                        =                  i                                ,                W                            )                                =                                                    ∑                                  k                  =                  1                                K                            ⁢                              xe2x80x83                            ⁢                                                ω                  ik                                ⁢                                  xe2x80x83                                ⁢                f                ⁢                                  xe2x80x83                                ⁢                                  (                                                            x                      t                                        |                                          θ                      ik                                                        )                                                      =                                          ∑                                  k                  =                  1                                K                            ⁢                              xe2x80x83                            ⁢                                                ω                  ik                                ⁢                                  xe2x80x83                                ⁢                N                ⁢                                  xe2x80x83                                ⁢                                                      (                                                                                            x                          l                                                |                                                  μ                          ik                                                                    ,                                              ∑                        ik                                                              )                                    .                                                                                        (        4        )            
Herein, xcfx89ik is mixture weight, and xcexcik and xcexa3ik are respectively the mean vector and covariance matrix of the k-th mixture density function for the state ŝt=i. The occurrence probability f(xt|xcex8ik) of frame xt associated with the density function xcex8ik=(xcexcik,xcexa3ik) is expressed by:
f(xt|xcex8ik)=(2xcfx80)xe2x88x92D/2|xcexa3ik|xe2x88x92xc2xdexp[xe2x88x92xc2xd(xtxe2x88x92xcexcik)xe2x80x2xcexa3ikxe2x88x921(xtxe2x88x92xcexcik)].xe2x80x83xe2x80x83(5)
The construction of speech recognition system is achieved by determining the HMM parameters, such as {xcexcik,xcexa3ik} {xcfx89ik} and {aij}. The speech recognition system is thus operated by using Viterbi algorithm to determine the optimal word sequence for the input speech. However, the surrounding noises will cause a mismatch between the speech features of the application environment and the training environment. As a result, the established HMM""s can not correctly describe the input speech, and the recognition rate is decreased. Particularly in the car environment, the noises are so adverse so that the performance of the speech recognition system in the car is much lower than that in a clean environment. Therefore, in order to implement, for example, an important application for human-machine voice interface in car environments, an adaptive speech recognition method with noise compensation is desired, so as to promote the recognition rate.
Moreover, Mansour and Juang observed that the additive white noise would cause the norm shrinkage of speech cepstral vector, and a description of such can be found in MANSOUR, D. and JUANG, B. -H.: xe2x80x98A family of distortion measures based upon projection operation for robust speech recognitionxe2x80x99, IEEE Trans. Acoustic, Speech, Signal Processing, 1989, vol. 37, pp. 1659-1671, which is incorporated herein for reference. They consequently designed a distance measure where a scaling factor was introduced to compensate the cepstral shrinkage for cepstrum-based speech recognition. This approach was further extended to the adaptation of HMM parameters by detecting an equalization scalar xcex between probability density function unit xcex8ik and noisy speech frame xt, as described in CARLSON, B. A. and CLEMENTS, M. A.: xe2x80x98A projection-based likelihood measure for speech recognition in noisexe2x80x99, IEEE Transactions on Speech and Audio Processing, 1994, vol. 2, no. 6, pp. 97-102, which is incorporated herein for reference. The probability measurement in (5) is modified to:
f(xt|xcex,xcex8ik)=(2xcfx80)xe2x88x92D/2|xcexa3ik|xe2x88x92xc2xdexp[xe2x88x92xc2xd(xtxe2x88x92xcexxcexcik)xe2x80x2xcexa3ikxe2x88x921(xtxe2x88x92xcexxcexcik)].xe2x80x83xe2x80x83(6)
The optimal equalization factor xcexe is determined by directly maximizing the logarithm of (6) as follows:                               λ          e                =                                                            arg                ⁢                                  xe2x80x83                                ⁢                max                            λ                        ⁢                          xe2x80x83                        ⁢            log            ⁢                          xe2x80x83                        ⁢            f            ⁢                          xe2x80x83                        ⁢                          (                                                                    x                    t                                    |                  λ                                ,                                  θ                  ik                                            )                                =                                                                      x                  t                  xe2x80x2                                ⁢                                  xe2x80x83                                ⁢                                                      ∑                    ik                                          -                      1                                                        ⁢                                      xe2x80x83                                    ⁢                                      μ                    ik                                                                                                μ                  ik                  xe2x80x2                                ⁢                                  xe2x80x83                                ⁢                                                      ∑                    ik                                          -                      1                                                        ⁢                                      xe2x80x83                                    ⁢                                      μ                    ik                                                                        .                                              (        7        )            
Geometrically, this factor is equivalent to the projection of xt upon xcexcik weighted by xcexa3ikxe2x88x921. The use of xcexe to compensate the influence of the white noise is proved to be helpful in increasing the speech recognition rate. However, for the problem of speech recognition in car environments, the surrounding noise is non-white and sophisticated to characterize. It is thus insufficient to adapt the HMM mean vector xcexcik by only applying the optimal equalization scalar xcexe. Therefore, there is a need for the above speech recognition method to be improved.
The object of the present invention is to provide an adaptive speech recognition method with noise compensation for effectively promoting the speech recognition rate in a noisy environment.
To achieve the object, the adaptive speech recognition method with noise compensation in accordance with the present invention is capable of compensating noises of an input speech by adjusting parameters of a HMM speech model. The method includes the following steps: (A) determining, based on the plurality of speech frames of the input speech and the speech model, optimal equalization factors for feature vectors of the plurality of speech frames corresponding to each probability density function in the speech model; and (B) adapting the parameters of the speech model by the optimal equalization factor and a bias compensation vector corresponding to and retrieved by the optimal equalization factor, wherein the optimal equalization factor is provided to adjust a distance of the mean vector in the speech model, and the bias compensation vector is provided to adjust a direction change of the mean vector in the speech model.