1. Field of the invention
The present invention generally relates to a telephone speech recognition system. More particularly, the present invention relates to device and method of channel effect compensation for telephone speech recognition.
2. Description of the Related Art
In a speech recognition application via a telephone network, speech signals are Inputted from the handset of a telephone and transmitted through a telephone line to a remote speech recognition system for recognition. Therein, the path speech signals pass includes the telephone handset and the telephone line, which are referred to as a xe2x80x9ctelephone channel or channelxe2x80x9d. In terms of signal transmission, the characteristic of the telephone channel will affect the speech signals during transmission, referred as a xe2x80x9ctelephone channel effect or channel effectxe2x80x9d. Mathematically, impulse response of the telephone channel is introduced with a convolved component into speech signals.
FIG. 1 is a diagram illustrating a typical telephone speech recognition system. As shown in the FIG. 1, a speech signal x(t) sent by the calling part becomes a telephone speech signal y(t) after passing through the telephone channel 1 comprising the telephone handset and the telephone line, and is inputted to the recognition system 10 for further processing. The recognition result R is generated by the recognition system 10. Here, assume the impulse response of the telephone channel 10 to be h(t), then the relationship between the speech signal x(t) and the telephone speech signal y(t) can be represented by:
y(t)=x(t){circle around (x)}h(t)xe2x80x83xe2x80x83(1)
where symbol xe2x80x9c{circle around (x)}xe2x80x9d represents the convolution operator. Most importantly, the impulse response h(t) in the telephone channel 1 varies with the caller""s handset and the transmits son path of speech signals in a telephone network (the transmission path determined by switching equipment). In other words, the same phone call (the same speech signal x(t)) will generate different telephone speech signals y(t) through different telephone channels (different impulse responses h(t). This environmental variation will affect the recognition ate of the recognition system 10. Therefore, compensation of telephone channel effect should be performed before undergoing telephone speech recognition to reduce such environmental variation.
The principle of typical telephone channel effect compensation will be briefly described in the following. Equation (1) represents the relationship between the speech signal x(t) and the telephone speech signal y(t) in time domain. If equation (1) is transformed to the spectral domain, then it can be represented by:
Y(f)=X(f)xe2x80xa2|H(f)|2xe2x80x83xe2x80x83(2)
where X(f) and Y(f) represent the power spectra of the speech signal x(t) and the telephone speech signal y(t), respectively, and H(f) represents the transfer function of the telephone channel 1.
The following logarithm spectral relation is obtained after processing the bilateral logarithms of equation (2):
log[Y(f)]=log[X(f)]+log└|H(f)|2¦xe2x80x83xe2x80x83(3)
The following will be obtained when inverse Fourier transformation Is used for projecting equation (3) on a cepstral domain:
xe2x80x83cy(xcfx84)=cx(xcfx84)+ch(xcfx84)xe2x80x83xe2x80x83(4)
where cx(xcfx84), cy(xcfx84), and ch(xcfx84) are the respective cepstral vectors of x(t), y(t), and h(t).
From equations (3) and 4), in logarithmic spectral and cepstral domain, the influence of the telephone channel upon the speech signals in transmission can be described with a bias. Therefore, most of the current telephone channel effect compensation means are developed and based upon such a principle. The difference lies in the bias estimation method and bias elimination method.
FIG. 2 is a block diagram illustrating a conventional telephone speech recognition system. As shown in the figure, the telephone speech recognition system comprises a feature analysis section 100, a channel effect compensation section 102 and a recognizer 104 (comprising a speech recognition section 104a for speech recognition and acoustic models 104b feature analysis section 100 first blocks the received telephone speech signal y(t) into frames, performs feature analysis on each telephone speech frame, and generates a corresponding feature vector o(t). in accordance with the description of the above equations (3) and (4), the feature vector o(t) may be a logarithmic spectral vector or a cepstral vector. Channel effect compensation section 102 subsequently performs compensation of the feature vector o(t), and the generated feature vector xc3x4(t) is inputted to the recognizer 104. Speech recognition section 104a performs the actual speech recognition according to the acoustic models 104b and generates the desired recognition result R. The three most popular telephone channel effect compensation techniques are the following: the relative spectral technique (RASTA), the cepstral mean normalization (CMN), and the signal bias removal (SBR) . The first technique adopts a fixed filter type, whereas the last two techniques calculate the bias from feature vectors of a telephone speech signal. These conventional techniques will be briefly described in the following references, the content of which is expressly incorporated herein by reference.
(A) RASTA: Refer H. Hermansky, N. Morgan, xe2x80x9cRASTA processing of speechxe2x80x9d HEEE Trans. On Speech and Audio Processing, vol. 2, pp.578-589, 1994 for derails. The operation of RASTA makes use of filters Go eliminate low-frequency components contained in the logarithmic spectral vectors or cepstral vectors, that is, the bias introduced by the telephone channel, for the purpose of the channel effect compensation. According to aforementioned analysis, bandpass infinite impulse response (IIR) filters expressed by the following equation (5) can perform quite well.                               H          ⁡                      (            z            )                          =                  0          ⁢                      :                    ⁢          1          xc3x97                                    1              +                              z                                  -                  1                                            -                              z                                  -                  3                                            -                              2                ⁢                                  z                                      -                    4                                                                                                      z                                  -                  4                                            ⁡                              (                                  1                  -                                      0.98                    ⁢                                          z                                              -                        1                                                                                            )                                                                        (        5        )            
The purposes of using a bandpass filter are twofold: firstly, for filtering out the bias by highpass filtering; and secondly, for smoothing the rapidly changing spectra by lowpass filtering. If only the telephone channel effect compensation is considered, only highpass filtering need be used. At this time, the transfer function of the highpass filter can be represented as follows:                               H          ⁡                      (            z            )                          =                              1            -                          z                              -                1                                                          1            -                                          (                                  1                  -                  λ                                )                            ⁢                              z                                  -                  1                                                                                        (        6        )            
RASTA has demonstrated its advantage in that it can be easily realized without causing response time delay problems, however, its disadvantage is that the range of the frequency band of the filter is predetermined and cannot be adjusted with the inputted telephone speech signal. Therefore, some useful speech information may be also deleted when the bias introduced by the telephone channel effect is filtered out; the recognition result will then be affected. As a result, the recognition result of a telephone speech recognition system obtained with RASTA compensation method is less effective than those obtained by CMN and SBR compensation methods.
(B) CMN : Refer F. Liu, R. M. Stern, X. Huang and A. Acero, xe2x80x9cEfficient cepstral normalization for robust speech recognition,xe2x80x9d Proc. Of Human Language Technology, pp.69-74, 1993 for details. The operation of CMN is to estimate the bias representing the characteristic of the telephone channel and to eliminate the bias from the logarithmic spectral vectors or cepstral vectors of the telephone speech signal. In CMN, a bias is represented by the cepstral mean vector of telephone speech signals. Since the bias is estimated from telephone speech signals, the telephone channel characteristic can be acquired and a better compensation can be obtained. However, CMN is performed by assuming the cepstral mean vector of the speech signal before passing the telephone channel to be a zero vector. Experimental results have demonstrated that such an assumption is valid when the input speech signals are long enough. But, when the speech signals are rot long enough, the phonetic information or the speech signals will affect the estimation of the bias; thus, the compensation result is rot significant.
(C) SBR: Refer M. G. Rahim and B. H. Hwang, xe2x80x9cSignal bias removal by maximum likelihood estimation for robust telephone speech recognition,xe2x80x9d IEEE Trans. Speech and Audio Processing, vol. 4, pp. 19-30, 1996 for details. The SBR algorithm estimates the bias in an iterative manner based upon the maximum likelihood criterion. Similarly, the compensated logarithmic spectral vectors or cepstral vectors can be obtained by subtracting the estimated bias from the logarithmic spectral vectors or cepstral vectors of telephone speech signals in an iterative manner. In contrast to the CMN, the SBR algorithm can estimate the bias more accurately; however, the response time delay for recognition prolongs comparatIvely. Further, since the SBR algorithm is a technique based upon the maximum likelihood criterion, the estimation error of maximum likelihood may also affect the accuracy of the estimation of the bias when the telephone speech signals are not long enough.
The above-mentioned three telephone channel effect compensation techniques share a common drawback in that accurate results will not be achieved when the telephone speech signals are not long enough. Moreover, these techniques merely deal with the feature vectors without considering the connection with speech recognizers.
The object of the present invention is to provide device and method of channel effect compensation, to accurately estimate the bias representing the characteristic of the telephone channel.
According to the above object, the present invention provides device and method of channel effect compensation for telephone speech recognition. The channel effect compensation device comprises:
a compensatory neural network for receiving an input signal and compensating the input signal with a bias to generate an output signal, wherein the compensatory neural network provide a plurality of first parameters to determine the bias.
During a training process, a feedback section could be coupled to the compensatory neural network for adjusting the first parameters according to the error between the bias and a target function.
During the other training process, a recognizer could be coupled to the compensatory neural network, with a speech recognition section for classifying the output signal according to a plurality of second parameters in acoustic models to generate a recognition result and determine a recognition loss thereby; and an adjustment section coupled to the compensatory neural network and the recognizer for adjusting the first parameters and the second parameters according to the recognition loss determined by the recognition result and an adjustment means.
Also, a method of compensating an Input signal in a telephone speech recognition system, comprises the following steps of:
receiving the input signal by a compensatory neural network; determlnlng a bias in response to a plurality of first parameters provided by the compensatory neural network; compensating the input signal with the bias; and sending out the compensated input signal to be an output signal.
During a training process, the training process comprising the steps of: receiving a plurality of feature vectors of a training telephone speech signal by the compensatory neural network; generating a bias in response to the first parameters provided by the compensatory neural network, the bias representing the characteristic of the telephone channel; comparing the bias with a target function to generate an error; and adjusting the first parameters by an error back-propagation algorithm according to a minimum mean square error criteria.
Wherein, the first parameters are determined during a training process, the training process comprising the steps of: receiving a plurality of feature vectors of a training telephone speech signal by the compensatory neural network; generating a bias in response to the first parameters provided by the compensatory neural network, the bias representing the characteristic of the telephone channel; compensating the feature vectors with the bias to generate a plurality of compensated feature vectors; classifying the compensated feature vectors in response to a plurality of second parameters in acoustic models by a speech recognition section to generate a recognition result; determining a recognition loss in response to the recognition result; and adjusting the first and second parameters according to the recognition loss and an adjustment means.