Speech synthesis techniques generate speech-like waveforms from textual  words or symbols. Speech synthesis systems have been used for various applications, including speech-to-speech translation applications, where a spoken phrase is translated from a source language into one or more target languages. In a speech-to-speech translation application, a speech recognition system translates the acoustic signal into a computer-readable format, and the speech synthesis system reproduces the spoken phrase in the desired language.
FIG. 1 is a schematic block diagram illustrating a typical conventional speech synthesis system 100. As shown in FIG. 1, the speech synthesis system 100 includes a text analyzer 110 and a speech generator 120. The text analyzer 110 analyzes input text and generates a symbolic representation 115 containing linguistic information required by the speech generator 120, such as phonemes, word pronunciations, phrase boundaries, relative word emphasis, and pitch patterns. The speech generator 120 produces the speech waveform 130. For a general discussion of speech synthesis principles, see, for example, S. R. Hertz, “The Technology of Text-to-Speech,” Speech Technology, 18-21 (April/May, 1997), incorporated by reference herein.
In a concatenative speech synthesis system, stored segments of human speech are typically pieced together to produce the speech output. When an utterance is synthesized by the speech generator 120, the corresponding speech segments are retrieved, concatenated, and modified to reflect prosodic properties of the utterance, such as intonation and duration. Each of the concatenated speech segments has an inherent natural pitch contour that was uttered by the speaker. However, when small portions of natural speech arising from different utterances in the segment database are concatenated, the resulting synthetic speech does not have a natural sounding pitch contour.
To produce natural-sounding speech, the speech generator 120 must produce acoustic values, durations, and pitch patterns that simulate properties of human speech. The acoustic values and durations of a speech segment depend on the neighboring segments, degree of syllable stress and position in the syllable. Pitch patterns are a function of linguistic properties of the utterance as a whole. Prediction of the pitch patterns is an important aspect of generating natural-sounding speech.
Typically, the pitch contour of the concatenated segments are modified using a predefined pitch contour, using either a statistical or rule-based method, that is imposed on the synthetic speech using digital signal processing techniques. The desired contour is typically specified as one or more values per vowel or syllable. Thereafter, the pitch contour values associated with each syllable are connected, for example, using a piece wise linear function, resulting in a continuous function of pitch versus time throughout the synthetic utterance.
While speech synthesis systems employing such pitch contour techniques perform effectively for a number of applications, they suffers from a number of limitations, which if overcome, could greatly expand the performance and utility of such speech synthesis systems. Specifically, currently available speech synthesis systems 100 fail to produce speech that approaches a natural-sounding human. A need therefore exists for a speech synthesis system that utilizes a pitch contour resulting in a more natural-sounding speech.