The present invention relates to speech recognition. More specifically, the present invention relates to generating an acoustic model for a speech recognizer from one or more different corpora, such as supervised and/or unsupervised training corpora. Conventional speech recognition engines compare an input signal, representative of an utterance of speech to be recognized, against speech and language related models. The speech recognizers then output a recognition result indicative of recognized speech (recognized from the input signal) based on the comparison against the models.
Most state-of-the-art speech recognition systems include two major components in their modeling techniques. Those components include a language model and an acoustic model.
The language model models the linguistic context of lexical units, which are usually words. A popular language model for dictation is an n-gram model. In the n-gram model, the likelihood of the next word, given a history of n−1 previous words, is predicted. Another type of language model is typically used on limited domain applications. That model is a context-free grammar, and is used where the input utterance is expected to follow a more strict sequence of words than is required for a general dictation system.
For example, in a system where a user is expected to answer the question “how old are you?”, the system may use a context-free grammar which begins with optional words “I am” followed by a number, and then followed by optional words “years old”. Such a stricter model constrains the search space and makes the recognition task both easier and faster.
An acoustic model models the sound produced by a human speaker. The acoustics vary partly based on the characteristics of the speaker. For example, the acoustics can vary based on different speakers, the accents of the speaker, or the speaking style, etc. However, the acoustics can vary based on other criteria as well, such as the particular microphone being used on the input end to the speech recognizer, the environment in which the speech recognizer is being used, the application domain in which the speech recognizer is operating, etc.
In order to generate a general acoustic model which is to be used in an application that is both speaker-independent and task-independent, a wide variety of data is used. For example, speech training data gathered from different speakers, different tasks, different microphones, etc., is simply pooled together and the parameters of the acoustic model are estimated without bias. The training corpus typically includes a plurality of different utterances represented by WAV files. Corresponding to each WAV file is a manual transcription of the words represented by the WAV file. Such a training corpus is referred to as supervised data, in that a laborious manual transcription has been preformed which corresponds exactly to the words spoken in the WAV file.
However, it is well known that a speaker-dependent acoustic model (one in which the acoustic model is trained on a single speaker and used by the same speaker only) produces two-three times lower word error rate than a speaker-independent acoustic model. Therefore, conventional dictation systems usually encourage the user to spend varying amounts of time “enrolling” himself or herself in the system. This often entails reading some pre-selected texts to the system for at least several minutes, and in many cases much longer.
Similarly, a task-dependent acoustic model (one in which the acoustic model is trained on only those utterances that are related to the task for which the acoustic model will be used) performs significantly better than a task-independent acoustic model. Such a system is discussed in F. Lefevre, J—L Gauvain and L. Lamel, Towards Task Independent Speech Recognition, ICASSP-2001.
In order to adapt a task-independent acoustic model to become a task-dependent acoustic model, one proposed solution has been to collect a task-dependent acoustic corpus and transcribe the acoustic corpus manually. However, sparse data presents a problem, in that collecting a sufficient amount of task-dependent data and manually transcribing it is a tedious and costly process.
Another way to adapt an acoustic model, which has been proposed in the past, is to use an existing body of close-captioned data. Such data is referred to as “lightly supervised data” in L. Lamel, J-L Gauvain and G. Adda, Investigating Lightly Supervised Acoustic Model Training, ICASSP-2001, because transcription generated during close-captioning is error prone and is generally not of good quality. In addition, the close-captioned data must be sorted through to obtain data that is task-dependent as well. A further problem with using lightly supervised data is that during close-captioning, phrase segmentation information may not be available.
Yet another proposed solution is to simply collect a huge amount of task-independent data, and simply hope that enough of the data is relevant to the task at hand that the acoustic model can be adequately trained. Of course, this is uncertain and can be costly and time consuming as well.
Still a further proposed solution is to use unsupervised training data, (data which has no manual transcription) and feed that data into a speech recognizer to obtain the associated transcription. However, a primary problem with using unsupervised training data is that it is unsupervised. Therefore errors in the first-pass speech recognition update incorrect parameters in the acoustic model and render this proposed solution inefficient.
The present invention addresses one or more of the problems discussed above.