The present invention relates to speech recognition. In particular, the present invention relates to modeling the performance of speech recognition systems.
In speech recognition, an acoustic signal is converted into a sequence of words using an acoustic model and a language model. The acoustic model converts features of the acoustic signal into possible sequences of sub-word speech units such as phones with probabilities. The language model provides probability distributions for various sequences of words that can be formed from the sequences of phones identified by the acoustic model.
Acoustic models are typically trained by having a speaker read from a known text and then crafting the acoustic model so that it predicts the training text from the training speech. Language models are typically trained from large corpora of text by simply identifying the probability of various word sequences in the corpora.
The performance of the resulting speech recognition system is somewhat tied to the training text used to train the acoustic model and the language model. As a result, in certain task domains, the speech recognition system will perform better than in other task domains. In order to determine how a speech recognition system will work in a particular task domain, someone must speak the words that a user is expected to use when performing the task in order to generate acoustic data that can be decoded by the system. Hiring people to generate a sufficient amount of acoustic data to determine the performance of the speech recognition system is expensive and forms a barrier to developing speech enabled computer applications.
In addition, because it is expensive to produce acoustic data, such data has not been generated for the entire corpora used to train the language model. As a result, the language model has typically been trained without examining how the acoustic model will perform on the language model corpora. Thus, it would be beneficial to have a system that allowed a corpus of text to be used in measuring the performance of the combination of an acoustic model and a language model without the need for acoustic data. This would allow for discriminative training of language models in combination with acoustic models.