Automatic speech recognition (ASR) systems are utilized in a variety of applications to automatically recognize the content of speech, and typically, to provide a textual representation of the recognized speech content. ASR systems typically utilize one or more statistical models (e.g., acoustic models, language models, etc.) that are trained using a relatively large corpus of training data. For example, speech/acoustic training data may be utilized to train one or more acoustic models. Via training, an acoustic model “learns” acoustic characteristics of the training data utilized so as to be able to accurately identify sequences of speech units in speech data received when the trained ASR system is subsequently deployed. To achieve adequate training, relatively large amounts of training data are generally needed.
Due in part to the wide-spread adoption and use of ASR technology, ASR systems are frequently utilized in a variety of environments and by a wide variety of users using different audio capture devices and channels. As a result, an ASR system may be utilized in an acoustic environment wherein received speech data is ill-matched, from an acoustic characteristic perspective, to training data on which the ASR system was trained. That is, the speech/acoustic training data used to train the corresponding acoustic model(s) may insufficiently or poorly represent acoustic characteristics of speech data received from users during deployment in a given acoustic environment. As a result, the accuracy of the ASR system in recognizing such speech data will suffer and may result in unsatisfactory performance.
Generally speaking, it is not feasible to train an acoustic model with training data that sufficiently represents or captures the acoustic characteristics of any and all arbitrary speech data that may be received by an ASR system in the variety of environments that the ASR system may be deployed. In particular, the variety of training data that would be needed to do so is not likely available in sufficient quantity, if available at all. As such, it is often not possible even to sufficiently train an acoustic model for a specific acoustic environment due to the lack of the relatively large amounts of training data representative of the specific acoustic environment that is conventionally needed to train an acoustic model.