This invention relates to speech recognition and more particularly to determination of utterance recognition parameter.
Referring to FIG. 1 there is illustrated a block diagram of a speech recognition system comprising a source 13 of Hidden Markov Models (HMM) and input speech applied to a recognizer 11. The result is recognized speech such as text. One of the sources of degradation for speech recognition of the input speech is the distortion due to transducer difference, channel, and speaker variability. Because this distortion is assumed to be additive in the log domain, utterance-based mean normalization in the log domain (or in any linear transformation of log domain, for example, cepstral domain) has been proposed to improve recognizers"" robustness. See, for example, S. Furui, xe2x80x9cCepstral Analysis Technique for Automatic Speaker Verification,xe2x80x9d IEEE Trans. Acoust., Speech and Signal Processing, ASSP-29(2):264-272, 1981. Due to its computational simplicity and substantial improvement in results, such mean normalization has become a standard processing technique for most recognizers.
To do such normalization, the utterance log-spectral mean must be computed over all N frames:                                           c            _                    N                ⁢                  =          Δ                ⁢                              1            N                    ⁢                                    ∑                              i                =                1                            N                        ⁢                          xe2x80x83                        ⁢                          c              i                                                          (        1        )            
where cn is the nth log spectral vector. The log spectral vectors are produced by sampling the incoming speech, taking a block or window of samples, performing a discrete Fourier transform on these samples, and performing logarithm of the transform output.
The technique is not suitable for on-line real time operation because, due to the requirement of the utterance mean, the normalized vectors can not be produced until the whole utterance has been observed. In equation 1, {overscore (c)}N is the log-spectral vector averaged over N windows. Since N means all N frames the application to real-time system is limited.
To solve this problem, sequential estimation of the mean vector with exponential smoothing techniques has been disclosed. See M. G. Rahim and B. H. Juang, xe2x80x9cSignal Bias Removal by Maximum Likelihood Estimation for Robust Telephone Speech Recognition,xe2x80x9d IEEE Trans. on Speech and Audio Processing, 4(1): Jan. 19-30, 1996. The sequential determination is that as we get more vectors we get better and better estimates as follows
{overscore (c)}n=xcex1xc2x7{overscore (c)}nxe2x88x921(past estimate)+(1xe2x88x92xcex1)xc2x7cn(current input vector)xe2x80x83xe2x80x83(2)
and the mean-subtracted vector:
ĉn=cnxe2x88x92{overscore (c)}nxe2x80x83xe2x80x83(3)
where {overscore (c)}n is an estimate of mean up to frame n and xcex1 is a weighting value between zero and one.
Among the choices for the initial mean {overscore (c)}0 and weighting factor a, the prior art discusses two cases.
The first is the cumulative mean removal case where                                           c            _                    0                =                              0            ⁢                          xe2x80x83                        ⁢            and            ⁢                          xe2x80x83                        ⁢            α                    =                                    n              -              1                        n                                              (        4        )            
Equation 2 reduces to                                           c            _                    n                =                                            m              _                        n                    ⁢                      =            Δ                    ⁢                                    1              n                        ⁢                                          ∑                                  i                  =                  1                                n                            ⁢                              xe2x80x83                            ⁢                              c                i                                                                        (        5        )            
In this-case at time n, the mean vector is approximated by the mean of all vectors observed up to time n. For large n, Equation 5 gives a mean that is very close to the true utterance mean, i.e., it converges to the utterance mean in Equation 1. On the other hand, when {overscore (c)}0=0, no prior knowledge of the mean is used, which will make the mean unreliable for short utterances. The second case is called exponential smoothing. The second case sets
{overscore (c)}0=mean vector over training data and xcex1 is between 0 and 1.xe2x80x83xe2x80x83(6)
Rearranging Equation 2, we get                                           c            _                    n                =                                            α              n                        ·                          c              0                                +                                    (                              1                -                α                            )                        ⁢                                          ∑                                  i                  =                  1                                n                            ⁢                              xe2x80x83                            ⁢                                                α                                      n                    -                    i                                                  ·                                  c                  n                                                                                        (        7        )            
The second term of Equation 7 is a weighted sum of all vectors observed up to time n. Due to the exponential decay of the weights xcex1nxe2x88x921, only the immediate past observed vectors are dominant contributors to the sum, while the more distant past vectors contribute very little. Consequently, for large n the mean given by Equation 7 will not usually be close to the true utterance mean, i.e., asymptotically, exponential smoothing does not give the utterance mean.
In accordance with one embodiment of the present invention an estimate of the utterance mean is determined by maximum a posterior probability (MAP) estimation. This MAP estimation is subtracted from the log-spectral vector of the incoming signal to be applied to a speech recognizer in a speech recognition system.