In the new, connected economy, it has become increasingly important for companies or service providers to become more in tune with their clients and customers. Such contact can be facilitated with automated telephonic transaction systems, in which interactively-selected prompts are played in the context of a telephone transaction, and the replies of a human user are recognized by an automatic speech recognition system. The answers given by the respondent are processed by the system in order to convert the spoken words to meaning, which can then be utilized interactively, or stored in a database. One example of such a system is described in U.S. Pat. No. 6,990,179, issued in the names of Lucas Merrow et al. on Jan. 24, 2006, and assigned to the present assignee, further discussed below.
In order for a computer system to recognize the words that are spoken and convert these words to text, the system must be programmed to phonetically break down the spoken words and convert portions of the words to their textural equivalents. Such a conversion requires an understanding of the components of speech and the formation of the spoken word. The production of speech generates a complex series of rapidly changing acoustic pressure waveforms. These waveforms comprise the basic building blocks of speech, known as phonemes. Vowels and consonants are phonemes and have many different characteristics, depending on which components of human speech are used. The position of a phoneme in a word has a significant effect on the ultimate sound generated. A spoken word can have several meanings, depending on how it is said. Linguists have identified allophones as acoustic variants of phonemes and use them to more explicitly describe how a particular word is formed.
To successfully interact on the telephone with people of all ages and from all geographic regions, it is essential to be as clear as possible when directing spoken voice prompts to such people. The people receiving an automated call should preferably know immediately when they are being asked questions that require an answer and when they are being presented important new information. It is sometimes desirable that the speaker follow up with certain information, in which case the information presented should be salient to them in order to maximize the chances of a desired follow up action.
Prior art has dealt with adjusting a voice user interface based on a user's previous interactions with the system. Other prior art has proposed digital enhancements or adjustments to speech to make it more comprehensible to the hearing impaired. However, little or no attention has been paid with regard to controlling the parameters of the spoken prompts in an optimal way so as to achieve the naturalness of speech in connection with the prompts.
A prior art system for transcribing the intonation patterns and other aspects of the prosody of English utterances is the ToBI system (standing for “Tones” and “Break Indices”), as described by Mary E. Beckman and Gayle Ayers Elam of the Ohio State University in 1993. See Guidelines for ToBI Labelling, The Ohio State University Research Foundation, Mary E. Beckman Si Gayle Ayers Elam, Ohio State University (ver. 3, March 1997).
The ToBI system was devised by a group of speech scientists from various different disciplines (electrical engineering, psychology, linguistics, etc.) who wanted a common standard for transcribing an agreed-upon set of prosodic elements, in order to be able to share prosodically transcribed databases across research sites in the pursuit of diverse research purposes and varied technological goals. See, also, Ayers et al., Modelling Dialogue Intonation, International Congress of Phonetic Sciences 1995, Stockholm, August 1995.
While prior art systems and techniques may be sufficient for their respective intended purposes, there exists a need for phone-based automated interactive informational systems and techniques that encourage participation and promote comprehension among a wide range of patients or listeners. Additionally, it is desirable to have systems and techniques that provide more natural and effective speech prompts for telephony-based interaction.