The present invention relates to speech recognition. More specifically, the present invention relates to word-specific acoustic models in a speech recognition system.
A speech recognition system receives a speech signal and attempts to decode the speech signal to identify a string of words represented by the speech signal. Conventional speech recognizers include, among other things, an acoustic model and a language model. The acoustic model models the acoustic features of speech units (such as phonemes). The language model models word order in the training data.
When the speech signal is received, acoustic features are extracted from the speech signal and compared against the models in the acoustic model to identify speech units contained in the speech signal. Once words are identified, the words are compared against the language model to determine the probability that a word was spoken, given its history (or context).
Conventional acoustic models, which model sub-word speech units (such as phonemes), have proven to be relatively accurate. However, it is widely known that acoustic models which model entire words, rather than simply sub-word units, are more accurate (assuming sufficient training data) in recognizing the words which are modeled. This is sometimes referred to as whole word modeling. However, whole word modeling presents its own significant disadvantages. Perhaps one of the largest disadvantages associated with whole word modeling is the model size. There are thousands of words in the English language. In order to obtain a broad coverage whole word acoustic model, at least one acoustic model would need to be trained for each word. This would result in an undesirably large model, and would consume an undesirably large amount of resources during training.
Another significant difficulty presented by whole word acoustic modeling relates to training data sparseness. For example, it is widely held that in order to accurately train an acoustic model, the training data must include several hundred instances of the utterance being modeled. Given the large number of words in the English language, the amount of training data required to accurately model each word would be extremely large, and it is very doubtful that a sufficient amount of training data could be obtained to model each word.
Hence, acoustic models which model sub-word speech units were developed. There are only approximately 40–50 phonemes in the English language. Therefore, the number of acoustic models required to cover the English language is relatively small. Context-dependent phones (such as triphones) have also been developed to improve accuracy. Even the number of triphones required in an acoustic model is drastically lower than would be required for a broad coverage whole word acoustic model. However, as mentioned above, modeling sub-word speech units sacrifices accuracy.