Computing devices can be used to process a user's spoken commands, requests, and other utterances into written transcriptions. In a common application, a user can speak into a microphone of a computing device, and an automated speech recognition module executing on the computing device can process the audio input and determine what the user said. Additional modules executing on the computing device can process the transcription of the utterance to determine what the user meant and/or perform some action based on the utterance. Automatic speech recognition (“ASR”) modules typically include an acoustic model and a language model. The acoustic model is used to generate hypotheses regarding which subword units (e.g., phonemes) correspond to an utterance based on the acoustic features of the utterance. The language model is used to determine which of the hypotheses generated using the acoustic model is the most likely transcription of the utterance based on lexical features of the language in which the utterance is spoken.
There can be two different types of ASR modules: speaker-independent and speaker-specific or environment-specific. In speaker-independent ASR modules, models are trained with data from multiple speakers. In speaker-specific or environment-specific ASR modules, models are trained with data from individual users or environments. Such systems identify individual users or environments as part of their operation. For example, individual users or environments can be identified from spoken utterances. In a simple case, a user can identify himself by name or by using some other identifier. In other cases, the automatic speech recognition process generates a transcript of the user's utterance that is used to determine the speaker's identity. For example, a user can be identified using acoustic models customized for speaker identification. Such speaker-specific models are trained to maximize the likelihood scores for specific users when processing utterances made by those users. The likelihood scores indicate the probabilities that particular utterances were actually made by the user. ASR modules that use such speaker-specific models commonly utilize hidden Markov models-Gaussian mixture models (“HMM-GMM”) for vocabulary tasks. In some cases, instead of using Gaussian mixture models (“GMMs”), artificial neural networks (“NNs”), including deep neural networks, may be used with HMMs to perform such tasks. A neural network used with an HMM is referred to as an HMM-NN. A GMM or NN acoustic model can be trained as a speaker-independent acoustic model by using data from a multitude of speakers. A speaker-specific acoustic model may then be adapted or derived from the speaker-independent acoustic model. Adapting or deriving a speaker-specific acoustic model from a speaker-independent acoustic model requires less data and training time than newly generating a speaker-specific acoustic model.