Current speech recognition systems typically include an acoustic model and a language model. The acoustic model is used to generate hypotheses regarding which sound subword units (such as phonemes) correspond to speech based on the acoustic features of the speech. The language model is used to determine which of the hypotheses generated using the acoustic model is the most likely transcription of the speech based on lexical features of the language in which the speech is spoken. The acoustic model and language model are typically generated and adapted using training data, including transcriptions known to be correct.
The acoustic models are typically created by comparing audio recordings of speech with their corresponding textual transcriptions and then generating statistical representations of the possible sounds of subword units in a language based on the comparison. Acoustic models are generally more accurate and effective in recognizing sounds when they are generated based on a very large number of samples obtained through an acoustic model training process. In one approach, acoustic model training requires a human speaker to speak a specific sequence of known text. In another approach, the speaker's speech may be captured, transcribed, and then corrected manually by the speaker. One drawback of acoustic model training, among others, is that it can be difficult, time-consuming, and expensive to acquire training data (e.g., human speech) and transcriptions known to be correct. For example, a human may be required both to speak the speech and perform the transcription.