Current speech recognition systems typically include an acoustic model and a language model. The acoustic model is used to generate hypotheses regarding which sound subword units (e.g., phonemes) correspond to speech based on the acoustic features of the speech. The language model is used to determine which of the hypotheses generated using the acoustic model is the most likely transcription of the speech based on lexical features of the language in which the speech is spoken. The acoustic model and language model are typically configured using training data, including transcriptions known to be correct.
In many current approaches, acoustic models are trained in a supervised manner. In one approach, supervised acoustic model training requires a human speaker to speak a specific sequence of known text. In another approach, the speaker's speech may be captured, transcribed, and then corrected manually by the speaker. One drawback of supervised acoustic model training, among others, is that it can be difficult, time-consuming, and expensive to acquire training data (e.g., human speech) and transcriptions known to be correct. For example, a human may be required both to speak the speech and perform the transcription.