For a considerable time in the history of speech synthesis, the speech produced has been mostly `neutral` in tone, or in the worst case, monotone, i.e., it has sounded disinterested, or deficient, in vocal emotionality. This is why the synthesized intonation produced by prior art systems frequently sounded robotic, wooden and otherwise unnatural. Furthermore, synthetic speech research has been directed primarily towards maximizing intelligibility rather than including naturalness or variety. Recent investigations into techniques for adding emotional affect to synthesized speech have produced mixed results, and have concentrated on parametric synthesizers which generate speech through mathematical manipulations rather than on concatenative systems which combine segments of stored natural speech.
Text-to-speech systems usually incorporate rules for the application of intonational attributes for the text submitted for synthetic output. However, these rule systems generate generally neutral tones and, further, are not well suited for authoring or editing emotional prose at a high level. The problem lies not only in the terminology, for example "baseline-pitch", but also in the difficulty of quantifying these terms. If given the task of entering a stage play into a synthetic speech environment, it would be unbearable (or, at the very least, highly challenging for the layperson) to have to choose numerical values for the various speech parameters in order to incorporate vocal emotion into each word spoken.
For example, prior art speech synthesizers have provided for the customization of the prosody or intonation of synthetic speech, generally using either high-level or low-level controls. The high-level controls generally include text mark-up symbols, such as a pause indicator or pitch modifier. An example of prior art high-level text mark-up phonetic controls is taken from the Digital Equipment Corporation DECtalk DTC03 (a commercial text-to-speech system) Owner's Manual where the input text string:
It's a mad mad mad mad world. PA1 It's a /!mad .backslash.!mad /!mad .backslash.!mad /.backslash.!world. PA1 ow&lt;1000&gt;! PA1 ow&lt;,90&gt;! PA1 ow&lt;1000,90&gt;!
can have its prosody customized as follows:
where /! indicates pitch rise, and .backslash.! indicates pitch fall.
Some prior art synthesizers also provide the user with direct control over the output duration and pitch of phonetic symbols. These are the low-level controls. Again, examples from DECtalk:
causes the sound ow! (as in "over") to receive a duration specification of 1000 milliseconds (ms); while
causes ow! to receive its default duration, but it will achieve a pitch value of 90 Hertz (Hz) at the end; while
causes ow! to be 1000 ms long, and to be 90 Hz at the end.
So, on the one hand, the disadvantage of the high-level controls is that they give only a very approximate effect and lack intuitiveness or direct connection between the control specification and the resulting or desired vocal emotion of the synthetic speech. Further, it may be impossible to achieve the desired intonational or vocal emotion effect with such a coarse control mechanism.
And on the other hand, the disadvantage of the low-level controls is that even the intonational or vocal emotion specification for a single utterance can take many hours of expert analysis and testing (trial and error), including measuring and entering detailed Hertz and milliseconds specifications by hand. Further, this is clearly not a task an average user can tackle without considerable knowledge and training in the various speech parameters available.
What is needed, therefore, is an intuitive graphical interface for specification and modification of vocal emotion of synthetic speech. Of course, other graphical interfaces for modification of sound currently exist. For example, commercial products such as SoundEdit.RTM., by Farallon Computing, Inc., provide for manipulation of raw sound waveforms. However, SoundEdit.RTM. does not provide for direct user manipulation of the waveform (instead, the portion of the waveform to be modified is selected and then a menu selection is made for the particular modification desired).
Further, manipulation of raw waveforms does not provide a clear intuitive means to specify vocal emotion in the synthetic speech because of the lack of clear connection between the displayed waveform and the desired vocal emotion. Simply put, by looking at a waveform of human speech, a user cannot easily ascertain how it (or modifications to it) will sound when played through a loudspeaker, particularly if the user is attempting to provide some sort of vocal emotion to the speech.
By contrast, the present invention is completely intuitive. The present invention provides for authoring, direct manipulation and visual representation of emotional synthetic speech in a simplified format with a high level of abstraction. A user can easily predict how the text authored with the graphical editor of the present invention will sound because of the power of the explicit and intuitive visual representation of vocal parameters.
Further, the present invention provides for the automatic specification of prosodic controls which create vocal emotional affect in synthetic speech produced with a concatenative speech synthesizer.
First of all, it is important to understand that speech has two main components: verbal (the words themselves), and vocal (intonation and voice quality). The importance of vocal components in speech may be indicated by the fact that children can understand emotions in speech before they can understand words. Intonation is effected by changes in the pitch, duration and amplitude of speech segments. Voice quality (e.g. nasal, breathy, or hoarse) is intrasegmental, depending on the individual vocal tract. Note that a glossary has been included as Appendix A for further clarification of some of the terms used herein.
Along a sliding scale of `affect`, voices may be heard to contain personalities, moods, and emotions. Personality has been defined as the characteristic emotional tone of a person over time. A mood may be considered a maintained attitude; whereas an emotion is a more sudden and more subtle response to a particular stimulus, lasting for seconds or minutes. The personality of a voice may therefore be regarded as its largest effect, and an emotion its smallest. The term `vocal emotion` will be used herein to encompass the full range of `affect` in a voice.
The full range of attributes may be created in synthesized speech. Voice parameters affected by emotion are the pitch envelope (a combination of the speaking fundamental frequency, the pitch range, the shape and timing of the pitch contour), overall speech rate, utterance timing (duration of segments and pauses), voice quality, and intensity (loudness).
If computer memory and processing speed were unlimited, one method for creating vocal emotions would be to simply store words spoken in varying emotional ways by a human being. In the present state of the art, this approach is impractical. Rather than being stored, emotions have to synthesized on-line and in real-time. In parametric synthesizers (of which DECtalk is the most well-known and most successful), there may be as many as thirty basic acoustic controls available for altering pitch, duration and voice quality. These include e.g., separate control of formants' values and bandwidths; pitch movements on, and duration of, individual segments; breathiness; smoothness; richness; assertiveness; etc. Precision of articulation of individual segments (e.g., fully released stops, degree of vowel reduction), which is controllable in DECtalk, can also contribute to the perception of emotions such as tenderness and irony. These parameters may be manipulated to create voice personalities; DECtalk is supplied with nine different `Voices` or personalities. It should be noted that intensity (volume) is not controllable within an utterance in DECtalk.
With a concatenative speech synthesizer, the type used in the preferred embodiment of the present invention, the range of acoustic controls is severely limited. Firstly, it is not possible to alter the voice quality of the speaker, since the speech is created from the recording of only one live speaker (who has their individual voice quality) speaking in one (neutral) vocal mode, and parameters for manipulating positions of the vocal folds are not possible in this type of synthesizer. Secondly, precision of articulation of individual segments is not controllable with concatenative synthesizers. It is nonetheless possible with the speech synthesizer used in the preferred embodiment of the present invention to control the parameters listed below:
TABLe 1 ______________________________________ Parameter Speech Synthesizer Commands ______________________________________ 1. Average speaking pitch Baseline Pitch (pbas) 2. Pitch range Pitch Modulation (pmod) 3. Speech rate Speaking rate (rate) 4. Volume Volume (volm) 5. Silence Silence (slnc) 6. Pitch movements Pitch rise (/), pitch fall (.backslash.) 7. Duration Lengthen (&gt;), shorten (&lt;) ______________________________________
Although there are seven parameters listed in the table above, the present invention claims that for concatenative synthesizers, it is possible to produce a wide range of emotional affect using the interplay of only five parameters--since Speech rate and Duration, and Pitch range and Pitch movements are, respectively, effected by the same acoustic controls. In other words, the present invention is capable of providing an automatic application of vocal emotion to synthetic speech through the interplay of only the first five elements listed in the table above.
Further, the present invention is not concerned with the details of how emotions are perceived in speech (since this is known to be idiosyncratic and varies among users), but rather with the optimal means of producing synthesized emotions from a restricted number of parameters, while still maintaining optimal quality in the visual interface and synthetic speech domains.