A variety of speech-recognition systems have been developed. These systems enable computers to understand speech. This ability is useful for inputting commands or data into computers. Speech recognition generally involves two phases. The first phase is known as training. During training, the system "learns" speech by inputting a large sample of speech and generating models of the speech. The second phase is known as recognition. During recognition, the system attempts to recognize input speech by comparing the speech to the models generated during training and finding an exact match or a best match. Most speech recognition systems have a front-end that extracts some features from the input speech in the form of feature vectors. These feature vectors are used to generate the models during training and are compared to the generated models during recognition.
One problem with such speech recognition systems arises when there are changes in the acoustical environment during and between training and recognition. Such changes could result, for example, from changes in the microphone used, the background noise, the distance between the speaker's mouth and the microphone, and the room acoustics. If changes occur, the system may not work very well because the acoustical environment affects the feature vectors extracted from speech. Thus, different feature vectors may be extracted from the same speech if spoken in different acoustical environments. Since the acoustical environment will rarely remain constant, it is desirable for a speech recognition system to be robust to changes in the acoustical environment. A particular word or sentence should always be recognized as that word or sentence, regardless of the acoustical environment in which the word or sentence is spoken. Some attempts to solve the problem of changes in the acoustical environment have focused on normalizing the input speech feature vectors to reduce the effect of such changes.
One attempt to solve this problem is known as mean normalization. Using mean normalization, the input speech feature vector is normalized by computing the mean of all the feature vectors extracted from the entire speech and subtracting the mean from the input speech feature vector using the function: ##EQU1## where x(t) is the normalized input speech feature vector, x(t) is the raw input speech feature vector, and n is the number of feature vectors extracted from the entire speech.
Another attempt to solve this problem is known as signal-to-noise-ratio-dependent ("SNR-dependent") normalization. Using SNR-dependent normalization, the input speech feature vector is normalized by computing the instantaneous SNR of the input speech and subtracting a correction vector that depends on the SNR from the input speech feature vector using the function: EQU x(t)=x(t)-y(SNR)
where x(t) is the normalized input speech feature vector, x(t) is the raw input speech feature vector, and y(SNR) is the correction vector. The correction vectors are precomputed and stored in a look-up table with the corresponding SNR's.
None of the prior attempts to solve the problem of changes in the acoustical environment during and between training and recognition have been very successful. Mean normalization allows the input speech feature vectors to be dynamically adjusted but is not very accurate because it only computes a single mean for all of the feature vectors extracted from the entire speech. SNR-dependent normalization is more accurate than mean normalization because it computes varying correction vectors depending on the SNR of the input speech but it does not dynamically update the values of the correction vectors. Therefore, a solution is needed that both is accurate and dynamically updates the values used to normalize the input speech feature vectors.