The present invention relates to adaptive training for speech recognition systems. In particular, the present invention relates to unsupervised adaptive training.
Speech recognition systems identify words in speech signals. To do this, most speech recognition systems compare the speech signal to models associated with small acoustic units that form all speech. Each comparison generates a likelihood that a particular segment of speech corresponds to a particular acoustic unit.
The acoustic models found in most speech recognition systems are trained using speech signals that are developed in an environment that is different from the environment in which the speech recognition system is later used. In particular, the speakers, microphones, and noise levels used during training are almost always different from the speaker, microphone, and noise level that is present when the speech recognition system is actually used.
It has been recognized that the differences between the training data and the actual data (usually referred to as test data) used during recognition degrades the performance of the speech recognition system.
One technique that has been used to address the differences between the training data and the test data is to adaptively change the acoustic models based on a collection of test data. Thus, a model that is initially trained on training data is modified based on actual speech signals generated while the speech recognition system is being used in the field.
Two types of adaptation have been used in the past: supervised adaptation and unsupervised adaptation. In supervised adaptation, the user reads from a script during an enrollment session. The system then uses the user's speech signal to adjust the models for the various acoustic units represented in the script. Although supervised adaptation is generally considered more accurate than unsupervised adaptation, it is also very boring for the users.
In unsupervised adaptation, the system adapts the acoustic model based on the user's normal use of the speech recognition system. Because the system has no way to predict what the user will say, it does not have an exact transcript of the speech signal. Instead, the system uses the acoustic model to decode the speech signal and thereby form the transcript. This decoded transcript is then used to update the model.
One major problem with unsupervised adaptation is that it requires a significant amount of time and data. In particular, in most prior art systems, the digital input speech signal or features derived from the speech signal must be stored until there is enough speech for adaptive training. Because it is difficult to predict the length of an utterance, it is difficult to estimate the size of the digitized speech signal. Because of this, the systems cannot accurately predict how much storage space will be needed to store the speech data. As a result, the system must be equipped to handle a full disc error message at any time during the speech storage stage or must reserve enough disc space so that there is sufficient space to handle the worst case size for the .WAV files. Since it is undesirable to have applications reserving more disc space than they absolutely need, such an overestimation of the space needed for the digitized speech signal should be avoided.
The time required to perform the training is dominated by a step of aligning individual frames of speech with a particular acoustic unit found in the transcription. The time needed to perform this alignment is typically a function of the square of the number of frames that need to be aligned. Thus, a system is needed that reduces the time needed to align frames of speech data.