The present invention relates generally to the field of text-to-speech conversion (i.e., speech synthesis) and more particularly to a method and apparatus for capturing personal speaking styles and for driving a text-to-speech system so as to convey such specific speaking styles.
Although current state-of-the-art text-to-speech conversion systems are capable of providing reasonably high quality and close to human-like sounding speech, they typically train the prosody attributes of the speech based on data from a specific speaker. In certain text-to-speech applications, however, it would be highly desirable to be able to capture a particular style, such as, for example, the style of a specifically identifiable person or of a particular class of people (e.g., a southern accent).
While the value of a style is subjective and involves personal, social and cultural preferences, the existence of style itself is objective and implies that there is a set of consistent features. These features, especially those of a distinctive, recognizable style, lend themselves to quantitative studies and modeling. A human impressionist, for example, can deliver a stunning performance by dramatizing the most salient feature of an intended style. Similarly, at least in theory, it should be possible for a text-to-speech system to successfully convey the impression of a style when a few distinctive prosodic features are properly modeled. However, to date, no such text-to-speech system has been able to achieve such a result in a flexible way.
In accordance with the present invention, a novel method and apparatus for synthesizing speech from text is provided, whereby the speech may be generated in a manner so as to effectively convey a particular, selectable style. In particular, repeated patterns of one or more prosodic featuresxe2x80x94such as, for example, pitch (also referred to herein as xe2x80x9cf0xe2x80x9d, the fundamental frequency of the speech waveform, since pitch is merely the perceptual effect of f0), amplitude, spectral tilt, and/or durationxe2x80x94occurring at characteristic locations in the synthesized speech, are advantageously used to convey a particular chosen style. In accordance with one illustrative embodiment of the present invention, for example, one or more of such feature patterns may be used to define a particular speaking style, and an illustrative text-to-speech system then makes use of such a defined style to adjust the specified parameter or parameters of the synthesized speech in a non-uniform manner (i.e., in accordance with the defined feature pattern or patterns).
More specifically, the present invention provides a method and apparatus for synthesizing a voice signal based on a predetermined voice control information stream (which, illustratively, may comprise text, annotated text, or a musical score), where the voice signal is selectively synthesized to have a particular desired prosodic style. In particular, the method and apparatus of the present invention comprises steps or means for analyzing the predetermined voice control information stream to identify one or more portions thereof for prosody control; selecting one or more prosody control templates based on the particular prosodic style which has been selected for the voice signal synthesis; applying the one or more selected prosody control templates to the one or more identified portions of the predetermined voice control information stream, thereby generating a stylized voice control information stream; and synthesizing the voice signal based on this stylized voice control information stream so that the synthesized voice signal advantageously has the particular desired prosodic style.