1. Technical Field
The present invention relates to speech synthesis and, more particularly, to methods for generating pitch and duration contours in a text to speech system.
2. Discussion of Related Prior Art
Speech generation is the process which allows the transformation of a string of phonetic and prosodic symbols into a synthetic speech signal. Text to speech systems create synthetic speech directly from text input. Generally, two criteria are requested from text to speech (TtS) systems. The first is intelligibility and the second, pleasantness or naturalness. Most of the current TtS systems produce an acceptable level of intelligibility, but the naturalness dimension, the ability to allow a listener of a synthetic voice to attribute this voice to some pseudo-speaker and to perceive some kind of expressivity as well as some indices characterizing the speaking style and the particular situation of elocution, is lacking. However, certain fields of application require maximal realism and naturalism such as, for example, telephonic information retrieval. As such, it would be valuable to provide a method for instilling a high degree of naturalness in text to speech synthesis.
For synthesis of natural-sounding speech, it is essential to control prosody. Prosody refers to the set of speech attributes which do not alter the segmental identity of speech segments, but instead affect the quality of the speech. An example of a prosodic element is lexical stress. It is to be appreciated that the lexical stress pattern within a word plays a key role in determining the way that word is synthesized, as stress in natural speech is typically realized physically by an increase in pitch and phoneme duration. Thus, acoustic attributes such a pitch and segmental duration patterns indicate much about prosodic structure. Therefore, modeling them greatly improves the naturalness of synthetic speech.
However, conventional speech synthesis systems do not supply an appropriate pitch to synthesized speech. Instead, flat pitch contours are used corresponding to a constant value of pitch, with the resulting speech waveforms sounding unnatural, monotone, and boring to listeners.
Early attempts to provide a speech synthesis system with pitch typically involved the use of rules derived from phonetic theories and acoustic analysis. The non-statistical, rule-based approaches suffer from their inability to learn from training data, thereby encompassing rigid systems which are unable to adapt to a specific style of speech or speaker characteristic without a complete re-write of the rules by a speech expert. More recent work on prosody in speech synthesis has taken a statistical approach (e.g., linear regressive analysis and tree regression analysis).
Implementing a non-constant pitch contour and varying the durations of individual phonemes has the potential to dramatically increase the quality of synthesized speech. Accordingly, it would be desirable and highly advantageous to provide methods for generating pitch and duration contours in a text to speech system.