A conversational agent is a software program that interprets and responds to statements made by users in ordinary natural language. Examples of conversational agents include Microsoft® Cortana®, Apple® Siri®, Amazon® Alexa® and Google® Assistant®. A traditional conversational agent includes an automatic speech recognition (ASR) system that receives an audio waveform and performs feature extraction to convert the audio waveform into sequences of acoustic features. The traditional ASR system includes an acoustic model (AM) and a language model (LM). The AM determines the likelihood of a senone from these acoustic features, where each senone is a triphone, while the LM determines the a priori likelihood of a sequences of words. The AM uses a pronunciation lexicon to select a maximally likely sequence of words given the input (e.g., acts as a speech transcription engine). The sequences of text output by the ASR are the input into a natural language understanding (NLU) system, which determines a speaker's intent based on the text output by the ASR. The speaker's determined intent is then input into a dialog management system that determines one or more actions to perform to satisfy the determined intent.
Often there is insufficient real world data to properly train an ASR system and/or NLU system. Accordingly, synthetic training data in some instances is generated to train the ASR system and/or the NLU system. For example, the LM of an ASR may be trained on a combination of real data and simulated training data. However, synthetic training data generated by a simulator is often substantially different from real world data that the ASR system and NLU system will operate on. Such a mismatch between training data and real world data (e.g., data used in testing and/or field application) degrades performance of the ASR system and/or NLU system. Such mismatches can be caused, for example, by variability in noise, reverb, speaker gender, age, accent, and so on. Additionally, people naturally use non-standard grammar when they speak in many situations, and make performance errors such as frequent stops, restarts, incomplete utterances, corrections, “ums”, “ands”, and so on that make it very challenging for the NLU to determine the correct speaker intent if its design is based on clear, grammatically correct, error-free speech. These phenomena often cause conversational agents to incorrectly determine speaker intent or fail to determine speaker intent.