A conversational agent is a software program that interprets and responds to statements made by users in ordinary natural language. Examples of conversational agents include Microsoft® Cortana®, Apple® Siri®, Amazon® Alexa® and Google® Assistant®. A traditional conversational agent receives an audio waveform, performs feature extraction to convert the the audio waveform into sequences of acoustic features, and inputs the sequences of acoustic features into an automatic speech recognition (ASR) system that includes an acoustical model (AM) and a language model (LM). The AM determines the likelihood of the mapping from these acoustic features to various hypothesized sequences of phonemes, while the LM determines the a priori likelihood of sequences of words. A decoder uses these two models together with a pronunciation lexicon to select a maximally likely sequence of words given the input (e.g., acts as a speech transcription engine). The sequences of text output by the ASR are the input into a natural language understanding (NLU) system, which determines a speaker's intent based on the text output by the ASR. The speaker's determined intent is then input into a dialog management system that determines one or more actions to perform to satisfy the determined intent.
Traditional conversational agents are designed to work in an open-ended domain in which the conversational agents receive inputs about a wide range of topics, determine a wide range of user intents based on the inputs, and produce a large range of outcomes based on the determined user intents. However, the ASR system of traditional conversational agents are often error prone and cause word level errors which are then propagated through the NLU system, which can ultimately cause the conversational agent to incorrectly determine speaker intent or fail to determine speaker intent. For example, acoustic distortions can make it very difficult to transcribe speaker utterances correctly. Accordingly, the accuracy of conversational agents degrades when there is noise (e.g., in real world conditions with background acoustic noise) or any other acoustic mismatch between training data and real world data (e.g., data used in testing and/or field application) that can degrade performance of the ASR. Such mismatches can be caused, for example, by variability in noise, reverb, speaker gender, age, accent, and so on. Additionally, people naturally use non-standard grammar when they speak in many situations, and make performance errors such as frequent stops, restarts, incomplete utterances, corrections, “ums”, “ands”, and so on that make it very challenging for the NLU to determine the correct speaker intent. These phenomena often cause conversational agents to incorrectly determine speaker intent or fail to determine speaker intent.