Computing devices can be used to process a user's spoken commands, requests, and other utterances into written transcriptions. In a common application, a user can speak into a microphone of a computing device, and an automated speech recognition module executing on the computing device can process the audio input and determine what the user said. Additional modules executing on the computing device can process the transcription of the utterance to determine what the user meant and/or perform some action based on the utterance.
Automatic speech recognition systems typically include an acoustic model and a language model. The acoustic model is used to generate hypotheses regarding which subword units (e.g., phonemes) correspond to an utterance based on the acoustic features of the utterance. The language model is used to determine which of the hypotheses generated using the acoustic model is the most likely transcription of the utterance based on lexical features of the language in which the utterance is spoken.
In some automatic speech recognition systems, users can be identified from spoken utterances. In a simple case, a user may identify himself by name or by using some other identifier, and the automatic speech recognition process generates a transcript which is used to determine the speaker's identity. In some cases, a user may be identified by building and using customized acoustic models for speaker identification. Such models are trained to maximize the likelihood scores for specific users when processing utterances made by those users.