The invention relates to channel normalization for automatic speech recognition.
The recognition performance (e.g., accuracy) of automatic speech recognition systems can be adversely affected by variability of the communication channel. Some causes of variability are due to the speaker (e.g., vocal tract geometry, glottal excitation), the transmission channel (e.g., the variable position and direction to the microphone, room acoustics, ambient noise), and the use of microphones with different characteristics. In order to reduce the influence of the communication channel on the recognition performance, numerous schemes have been proposed. One such technique normalizes the recognition feature vector of cepstral coefficients such that each feature dimension feature[i] has zero mean and unit variance with respect to time t. This technique is typically applied using K cepstral coefficients (or mel-frequency cepstral coefficients) cepstrum[i] and their first and second order derivatives (Δcepstrum[i] and ΔΔcepstrum[i]) to calculate normalized recognition features:feature[i]=(cep[i]−μ[i])/σ[i] for 0≦i<3Kwith:cep[i]=cepstrum[i]cep[i+K]=Δcepstrum[i] for 0≦i<Kcep[i+2K]=ΔΔcepstum[i]where μ[i] is the mean of cep[i] with respect to time t, and σ2[i] is the variance of cep[i] with respect to time t.
The cepstral mean normalization (i.e., subtraction of μ[i]) allows the removal of a stationary and linear, though unknown, channel transfer function. The cepstral variance normalization (i.e., division by σ[i]) helps to compensate for the reduction of the variance of the cepstral coefficients due to additive noise.
The amount of time over which to base the estimation of the channel characteristics can affect the performance of the speech recognizer. If the time window is chosen too long, the channel may not be considered stationary anymore. If the time window is chosen too short, the particular phonetic content of the speech segment can bias the estimation of the channel characteristics. As a compromise, many recognition systems estimate the channel based on a complete utterance of speech. Dependent upon the processing speed of the recognition system, this utterance-based normalization can lead to undesirable system delays since processing of the utterance does not start until the utterance has ended. Time-synchronous (or online processing) schemes typically utilize some type of recursive realization of the channel normalization, in which the long-term estimates for the mean and variance of the cepstral features are incrementally updated in time t, every τ=10-20 msec:μ[i,t]=αμ[i,t−τ]+(1−α)cep[i,t]σ2[i,t]=ασ2[i,t]+(1−α)(cep[i,t]−μ[i,t])2 
Non-speech segments represent another complicating factor during channel estimation. Since the transmission channel separates the speaker from the microphone, the effect of the transmission channel only becomes auditorily apparent during speech segments. Consequently, a variable ratio of non-speech segments to speech segments will have a profound effect upon the estimated channel characteristics. However, trying to use a fixed ratio is limited by the uncertainties involved in differentiating between speech and non-speech segments.