Spoken language processing systems include various modules and components for receiving speech input from a user, determining what the user said, and responding to what the user said. In some implementations, a spoken language processing system includes an automatic speech recognition (“ASR”) module that receives audio input of a user utterance and generates one or more likely transcriptions of the utterance. Spoken language processing systems may also include a natural language understanding (“NLU”) module that receives textual input, such as a transcription of a user utterance, and determines the meaning of the text in a way that can be acted upon, such as by a computer application. Spoken language processing systems may also include an output generator (“OG”) that manages interaction of a user with the system, prompts the user for information that may be required to execute various applications or perform various functions, generates responses, provides outputs corresponding to the responses or other user input to the user, etc. Mistakes in speech processing can lead to erroneous responses.
Using a client device with an output generator may facilitate the playback and/or display of content, such as audio books, electronic books (also referred to as e-books), songs, videos, television programs, computer and video games, multi-media content, and the like. For example, a user of a client device may make a spoken utterance requesting, “Play ‘Fly Me to the Moon.’” Audio of the spoken command can be transcribed by the ASR module. The NLU module can determine the user's intent (e.g., that the user wants a certain song played) from the transcription. The output generator may then generate a response to the user's question, including initiating various applications or performance of various functions.
The generated response can include a user interface element. The user interface element can include the name of a content item. For example, the name of a content item can be a song title, artist name, movie title, etc. The user interface element may be in the format of a media output, such as audio output, spoken output, written output, visual output, etc. The output generator may utilize the user interface element to prompt the user for additional information or for confirmation of the correct output. For example, when the user would like a song played, the output generator may prompt the user for confirmation of the correct song (e.g., User: “Play me ‘Fly Me to the Moon.’” OG: “You'd like to play ‘Fly Me to the Moon,’ correct?”), or present the user with a user interface element as part of the generated response (e.g., User: “Play me ‘Fly Me to the Moon.’” OG: “Now playing ‘Fly Me to the Moon.’” Client device begins playing “Fly Me to the Moon.”). While output generators may be used to manage interactions between users and spoken language processing systems, these output generators can still encounter difficulties when trying to resolve spoken language processing system recognition errors.