Automatic speech recognition (ASR) systems decode a stream of acoustic speech and transcribe it into a sequence of words or text. ASR systems are generally built on classifiers that use a combination of acoustic models and language models to perform that transcription. In order for the ASR system to achieve an improved level of performance, these acoustic and language models must typically be generated from training data that more closely matches the operational environment or scenario in which the ASR will be used. This may include, for example, the speaker's acoustic profile, the context of the conversation and the subject matter or content domain of the conversation. Unfortunately these factors typically vary dynamically over time and existing ASR systems are generally limited to static acoustic and language models, resulting in relatively poorer recognition performance. This is particularly true when trying to recognize conversational speech between humans, as opposed to human-to-machine speech (e.g., speaking to a smartphone digital assistant).
Although the following Detailed Description will proceed with reference being made to illustrative embodiments, many alternatives, modifications, and variations thereof will be apparent to those skilled in the art.