Computing devices can be used to process a user's spoken commands, requests, and other utterances into written transcriptions. In a common application, a user can speak into a microphone of a computing device, and an automated speech recognition module executing on the computing device can process the audio input and determine what the user said. Additional modules executing on the computing device can process the transcription of the utterance to determine what the user meant and/or perform some action based on the utterance.
Automated speech recognition systems typically include an acoustic model and a language model. The acoustic model is used to generate hypotheses regarding which subword units (e.g., phonemes) correspond to an utterance based on the acoustic features of the utterance. The language model is used to determine which of the hypotheses generated using the acoustic model is the most likely transcription of the utterance based on lexical features of the language in which the utterance is spoken.
In some automated speech recognition systems, audio input of a user utterance is separated into time slices, referred to as frames (e.g., a frame=10 milliseconds of the utterance). Each of the frames is processed using statistical methods such that the frames more closely correspond to portions of the acoustic model. This process may be referred to as normalization. In many cases, after a preliminary transcript is generated, a second speech recognition pass is performed using other statistical methods selected to maximize the likelihood of an accurate transcription. For example, a transform, such as a full covariance constrained maximum likelihood linear regression (“cMLLR”) transform may be generated or updated based on statistics from the processing of multiple utterances. The transform is used to further process the frames such that they more closely correspond to portions of the acoustic model. This process may be referred to as speaker adaptation.