Voice interfaces determine whether an audible utterance includes a command and how to behave in response. Typically, a response to an uttered command is an execution of that command, whether in the form of a verbal response or in the form of an action executing the command, or both. For instance, a response to the uttered command “What is the temperature outside?” could be the audible and/or displayed textual verbalization “seventy-five degrees”. As another example, a response to the uttered command, “Play ‘Yesterday’ by the Beatles” could be an action executing the command; i.e., to play the song “Yesterday” with a media playing device and/or to verbalize, audibly and/or textually, a confirmation of the command execution being performed, e.g., “playing ‘Yesterday’ by the Beatles.”
Whether the command is executed or not (the latter could occur if, e.g., the command is misinterpreted or unintelligible), the quality of the response can be deficient. Deficiencies in response quality include, for example, insensitivity in the response to one or more emotions detectable in the utterance that includes the command, and/or a time delay between the utterance and the fulfillment of the command included in the utterance. Systems that process both language and emotion can take longer to respond to uttered commands than systems processing language but not emotion, which can frustrate the user issuing the command. Another deficiency can be a response that fails to take into account detectable emotion from past interactions when responding to a present utterance.
WO2017044260A1 describes receiving an audio input containing a media search request, determining a primary user intent corresponding to the media search request, and determining one or more secondary user intents based on one or more previous user intents.
WO2017218243A2 describes a system for adapting an emotion text-to-speech model. A processor receives training examples comprising speech input and labelling data comprising emotion information associated with the speech input. Audio signal vectors are extracted from training examples to generate an emotion-adapted voice font model based on the audio signal vectors and the labeling data.
CN106251871A describes combining emotion recognition and voice recognition to improve home music playing.
US20160019915A1 describes recognizing emotion in audio signals in real time. If for a given audio signal a threshold confidence score for one or more particular emotions is exceeded, the particular emotion or emotions are associated with that audio signal.
US20140172431A1 describes playing music based on speech emotion recognition.
US20140112556A1 describes using sensors and a processor to analyze acoustic, visual, linguistic, and physical features from signals with machine learning algorithms and extracting an emotional state of a user by analyzing the features.
WO2007098560A1 describes extracting an emotional state from an input data stream from a user.
Davletcharova et. al., “Detection and Analysis of Emotion from Speech Signals,” Procedia Computer Science (2015) (https://arxiv.org/ftp/arxiv/papers/1506/1506.06832.pdf) describes experiments relating to detecting the emotional state of a person by speech processing techniques.
U.S. Pat. No. 7,590,538B2 describes recognizing voice commands for manipulating data on the internet, including detecting the emotion of a person based on a voice analysis.
US20020194002A1 describes detecting emotion states in speech signals using statistics. Statistics or features from samples of the voice are calculated from extracted speech parameters. A neural network classifier assigns at least one emotional state from a finite number of possible emotional states to the speech signal.
U.S. Pat. No. 9,788,777B1 describes identifying an emotion that is evoked by media using a mood model.
De Pessemier et. al., “Intuitive Human-Device Interaction for Video Control and Feedback,” (https://biblio.ugent.be/publication/8536887/file/8536893.pdf) describes speech and emotion recognition using machine learning.