It has been argued that users' positive or negative reaction to a speech user interface can be affected by the extent to which they “self-identify” with the persona (voice and human characteristics) of the system. It is generally agreed in the human-computer interaction literature that callers can recognize and react to the emotive content in a speech sample in speech recognition systems.
However, as a converse to the above phenomenon, the question is raised: can computers recognize and react to the emotive content of what a caller says in a speech user interface? The key problem to addressing this question has been how to develop an algorithm with enough “intelligence” to detect the emotion (or persona) of the caller and then adjust its dialog to respond accordingly.
One current solution to this problem is to capture the voice features (pitch/tone or intonation) of the user and run this information through a pitch-synthesis system to determine the user's emotion (or persona). One of the biggest problems with this approach is its inconclusiveness. This is based on the fact that the dimensions or resulting categories of emotion are based on matching pitch characteristics (loud, low, normal) with emotional values such as “happy” or “sad” as well as the indeterminate “neutral.”
The problem with using pitch for emotional determination is that emotional values cannot always be based on absolute values. For example, a user may be “happy” but speak in a “neutral” voice, or they may be sad and yet speak in a happy voice. In addition, it is not exactly clear in this existing approach what constitutes a “neutral” voice and how you would go about measuring this across a wide range of user population, demography, age, etc.