Referring to FIG. 8, in a typical speech recognition engine 1000, a signal 1002 corresponding to speech 1004 is fed into a front end module 1006. The front end 1006 module extracts feature data 1008 from the signal 1002. The feature data 1008 is input to a decoder 1010, which the decoder 1010 outputs as recognized speech 1012. The recognized speech 1012 is then available as an input to an application 1014.
An acoustic model 1018 and a language model 1020 also supply inputs to the decoder 1010. Generally, the acoustic model 1018, also called a voice model, identifies to which phonemes the feature data 1008 most likely correlate. The language model 1020 consists of the certain context dependent text elements, such as words, phrases and sentences, based on assumptions about what a user is likely to say. The language model 1020 cooperates with the acoustic model 1018 to assist the decoder 1010 in further constraining the possible options for the recognized speech 1012.
Referring to FIG. 9, by methods known in the art, the acoustic model 1018 is trained by training system 1110. The training system 1110 includes a training module 1112 using a phoneme set 1114, a dictionary 1116 and a training data set. The dictionary includes a plurality of text elements, such as words and/or phrases, and their phonetic spellings using phonemes from the phoneme set. The training data set include an audio file set 1118 including a plurality of audio files, such as wave files of recorded speech, and a transcription set 1120 including a plurality of transcriptions corresponding to the recorded speech in the audio files. Typically, the transcriptions are grouped into a single transcription file including a plurality of lines of transcribed text, each line of transcribed text including the name of the corresponding audio file.
In practice, the textual content of the training data set, represented by the transcriptions, is generally selected to cover a wide range of related applications. The resultant acoustic model can then be “shared” by all the related applications. While this approach saves the expense of training a separate acoustic model for individual applications there is a corresponding loss in accuracy for the individual applications.