Speech recognition technology allows a user of a communications network to access a computer or a hand-held electronic device without using a keyboard to type in words, for example. In particular, a spoken language system provides user-computer interaction, which enables natural conversations between people and machine.
A speech recognition system is roughly divided into a feature extractor (front-end) and a recognizer (back-end). The front-end algorithm converts the input speech waveform signal into feature parameters, which provide a compact representation of the input speech, while retaining the information essential for speech recognition. The back-end algorithm performs the actual recognition task, taking the feature parameters as input and performing a template-matching operation to compare the features with reference templates of the possible words, or other units of speech, to be recognized.
Typically, in a speech recognition system, the front-end is used to convey feature parameters, instead of the encoded speech waveform, to a speech recognition back-end. In particular, when speech recognition processing is carried out in a Distributed Speech Recognition (DSR) system, feature parameters require less bandwidth for radio transmission than the encoded speech waveform and, therefore, can be sent to an automatic speech recognition (ASR) server using a data channel. This eliminates the need for a high bit-rate speech channel. In embedded systems like mobile terminals, the front-end provides the speech features to the back-end in a form that is better suited for recognition than the original sampled speech.
The European Telecommunications Standard Institute (ETSI) has established the standard for DSR signal processing. In ETSI ES 201 108 V1.1.2, a standard algorithm for front-end feature extraction and their transmission is published. The standard algorithm calculates feature vectors with fourteen components for each 10 ms frame of speech. In particular, this ETSI publication covers the algorithm for front-end feature extraction to create Mel-Frequency Cepstral Coefficients (MFCC). While the standard algorithm, as disclosed in the ETSI publication, is designed for wireless transmission, the basic methodology is applicable to a speech recognition system embedded in a hand-held electronic device, for example. Cepstrum is a term for the Discrete Cosine Transform of the logarithm of the power spectrum of a signal, and mel-frequency warping is a process of non-linearly modifying the scale of the Fourier transform representation of the spectrum. From the mel-frequency warped Fourier transform representation of the log-magnitude spectrum, a set of cepstral coefficients or parameters are calculated to represent the speech signals. The extracted cepstral coefficients or parameters are known as feature vectors. They are conveyed to the back-end recognizer to perform the actual probability estimation and classification in order to recognize the spoken words. Because different speakers have different voices, talking speeds, accents and other factors that can affect a speech recognition system, it is important to have good quality feature vectors to ensure a good performance in speech recognition. Furthermore, environmental noises and distortion can also deteriorate the quality of feature vectors and influence the performance of the speech recognition system.
Currently, the performance of a speech recognition system is improved by training the acoustic models with relatively noise-free speech data to maximize the performance in clean speech conditions. FIG. 1 shows a standard MFCC front-end. As shown, the input speech is transformed by spectral conversion (FFT) into a set of spectral coefficients. The spectral coefficients are scaled by a Mel-scaling module. Typically, the front-end produces a feature vector (frame) in every 10 ms. After Mel-scaling, the speech signal is represented as an N (N=22) dimensional vector where each component corresponds to the spectral energy of that frequency band. After the Mel-scaling, a non-linear transform (Logarithm) is applied to the Mel-vector components. Discrete Cosine Transform (DCT) is then used to de-correlate the signal. A differentiator is used to obtain the information between consecutive frames by taking the first and second derivatives of the vector. Finally, cepstral domain feature vector normalization is applied to reduce the mismatch between training and testing conditions.
When this type of speech recognition system is used in a high-noise environment, such as in a car, the background noise may cause a mismatch between the acoustic models and the speech data. Currently, histogram normalization techniques are used to reduce the mismatch. In a histogram of spectral coefficients, the abscissa corresponds to the spectral values, and the ordinate values correspond to the likelihood of the corresponding spectral value. In a noisy environment, such as in a fast-moving car, the feature vectors may be changed due to noise and become different from those obtained in a quiet environment. Consequently, the shape and position of the histogram of the testing spectral signals are significantly different from those of the training spectral signals. In a front-end, as shown in FIG. 1, the changes in the features are compensated for in the cepstral domain by feature vector normalization. This method, known as cepstral domain feature vector normalization, is an effective method in improving noise robustness. However, it has its disadvantages. When DCT is applied on the distorted (noisy) spectral signals, the distortion spreads over all cepstral parameters. Even if the environmental noise is localized in a certain frequency band, the noise will affect all of the cepstral coefficients after the DCT process. Thus, even if cepstral domain feature vector normalization effectively removes the mismatch between different environments, the normalized signal will always have the residues of noise in all of the cepstral coefficients.
Mammone et al. (U.S. Pat. No. 6,038,528) discloses a speech processing method, wherein cepstral parameter normalization is based on affine transformation of the ceptral coefficients. This method is concerned with the coefficients after cepstral transformation and, therefore, is also susceptible to the spreading of noise energy to the components of the cepstrum.
Molau et al. (“Histogram based Normalization in the Acoustic Feature Space”, ASRU 2001 Workshop on Automatic Speech Recognition and Understanding, 2001) and Hilger et al. (“Quantile Based Histogram Equalization for Noise Robust Recognition”, EUROSPEECH 2001, pp. 1135–1138) disclose two off-line histogram normalization techniques, wherein the histogram of the training data and the histogram of the test data are required to be sent to the back-end in advance. These techniques are not practical in that more data of the distribution regarding the histogram is required. Furthermore, the method, according to Hilger et al., requires a delay (between speech input and speech recognition) of one utterance typically lasting several seconds. The method, according to Molau et al., is also impractical because it requires all the data from the same test speaker.
It is advantageous and desirable to provide a speech recognition front-end with improved performance, wherein the problems associated with the spreading of noise energy can be minimized, and the delay between speech input and speech recognition is reasonably short.