Automatic speech recognition (ASR) systems are utilized in a variety of applications to automatically recognize the content of speech, and typically, to provide a textual representation of the recognized speech content. ASR systems typically utilize one or more statistical models (e.g., acoustic models, language models, etc.) that are trained using a corpus of training data. For example, speech training data obtained from multiple speakers may be utilized to train one or more acoustic models. Via training, an acoustic model “learns” acoustic characteristics of the training data utilized so as to be able to accurately identify sequences of speech units in speech data received when the trained ASR system is subsequently deployed. To achieve adequate training, relatively large amounts of training data are generally needed.
Acoustic models are implemented using a variety of techniques. For example, an acoustic model may be implemented using a generative statistical model such as, for example, a Gaussian mixture model (GMM). As another example, an acoustic model may be implemented using a discriminative model such as, for example, a neural network having an input layer, an output layer, and one or multiple hidden layers between the input and output layers. A neural network having multiple hidden layers (i.e., two or more hidden layers) between its input and output layers is referred to herein as a “deep” neural network.
A speaker-independent acoustic model may be trained using speech training data obtained from multiple speakers and, as such, may not be tailored to recognizing the acoustic characteristics of speech of any one speaker. To improve speech recognition performance on speech produced by a speaker, however, a speaker-independent acoustic model may be adapted to the speaker prior to recognition by using speech data obtained from the speaker. For example, a speaker-independent GMM acoustic model may be adapted to a speaker by adjusting the values of the GMM parameters based, at least in part, on speech data obtained from the speaker. The manner in which the values of the GMM parameters is adjusted during adaptation may be determined using techniques such as maximum likelihood linear regression (MLLR) adaptation, constrained MLLR (CMLLR) adaptation, and maximum-a-posteriori (MAP) adaptation.
The data used for adapting an acoustic model to a speaker is referred to herein as “enrollment data.” Enrollment data may include speech data obtained from the speaker, for example, by recording the speaker speak one or more utterances in a text. Enrollment data may also include information indicating the content of the speech data such as, for example, the text of the utterance(s) spoken by the speaker and/or a sequence of hidden Markov model output states corresponding to the content of the spoken utterances.