Effective text-to-speech (TTS) conversion requires not only that the acoustic TTS output be phonetically correct, but also that it faithfully reproduce the sound and prosody of human speech. When the range of phrases and sentences to be reproduced is fixed, and the TTS converter has sufficient memory resources, it is possible simply to record a collection of all of the phrases and sentences that will be used, and to recall them as required. This approach is not practical, however, when the text input is arbitrarily variable, or when speech is to be synthesized by a device having only limited memory resources, such as an embedded speech synthesizer in a handheld computing or communication device, for example.
TTS systems for synthesis of arbitrary speech typically perform three essential functions:                1. Division of text into synthesis units, or segments, such as phonemes or other subdivisions.        2. Determination of prosodic parameters, such as segment duration, pitch and energy.        3. Conversion of the synthesis units and prosodic parameters into a speech stream.A useful survey of these functions and of different approaches to their implementation is presented by Robert Edward Donovan in Trainable Speech Synthesis (Ph.D. dissertation, University of Cambridge, 1996), which is incorporated herein by reference. The present invention is concerned primarily with the third function, i.e., generation of a natural, intelligible speech stream from a sequence of phonetic and prosodic parameters.        
In order to synthesize high-quality speech from an arbitrary text input, a large database is created, containing speech segments in a variety of different phonetic contexts. For any given text input, the synthesizer then selects the optimal segments from the database. Typically, the selection is based on a feature representation of the speech, such as mel-frequency cepstral coefficients (MFCCs). These coefficients are computed by integration of the spectrum of the recorded speech segments over triangular bins on a mel-frequency axis, followed by log and discrete cosine transform operations. Computation of MFCCs is described, for example, by Davis et al. in “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Transactions on Acoustics, Speech and Signal Processing ASSP-28 (1980), pp. 357–366, which is incorporated herein by reference. Other types of feature representations are also known in the art.
In order to dynamically choose the optimal segments from the database in real time, the synthesizer applies a cost function to the feature vectors of the speech segments, based on a measure of vector distance. The synthesizer then concatenates the selected segments, while adjusting their prosody and pitch to provide a smooth, natural speech output. Typically, Pitch Synchronous Overlap and Add (PSOLA) algorithms are used for this purpose, such as the Time Domain PSOLA (TD-PSOLA) algorithm described in the above-mentioned thesis by Donovan. This algorithm breaks speech segments into many short-term (ST) signals by Hanning windowing. The ST signals are altered to adjust their pitch and duration, and are then recombined using an overlap-add scheme to generate the speech output.
Although PSOLA schemes give generally good speech quality, it requires a large database of carefully-chosen speech segments. One of the reasons for this requirement is that PSOLA is very sensitive to prosody changes, especially pitch modification. Therefore, in order to minimize the prosody modifications at synthesis time, the database must contain segments with a large variety of pitch and duration values. Other problems with PSOLA schemes include:                Frequent mismatch between the selection process, which is based on spectral features extracted from the speech, and the concatenation process, which is applied to the ST signals. The result is audible discontinuities in the synthesized signal (typically resulting from phase mismatches).        High computational complexity of the segment selection process, caused by a complex cost function usually introduced to overcome the limitations mentioned above.        Large additional overhead to the speech data in the database (for example, pitch marking and features for segment selection) and a complex database generation (training) process. There is therefore a need for a speech synthesis technique that can provide high-quality speech output without the large memory requirements and computational cost that are associated with PSOLA and other concatenative methods known in the art.        
Various methods of concatenative speech synthesis are described in the patent literature. For example, U.S. Pat. No. 4,896,359, to Yamamoto et al., whose disclosure is incorporated herein by reference, describes a speech synthesizer that operates by actuating a voice source and a filter, which processes the voice source output based on a succession of short-interval feature vectors. U.S. Pat. No. 5,165,008, to Hermansky et al., whose disclosure is likewise incorporated herein by reference, describes a method for speech synthesis using perceptual linear prediction parameters, based on a speaker-independent set of cepstral coefficients. U.S. Pat. No. 5,740,320, to Itoh, whose disclosure is also incorporated herein by reference, describes a method of text-to-speech synthesis by concatenation of representative phoneme waveforms selected from a memory. The representative waveforms are chosen by clustering phoneme waveforms recorded in natural speech, and selecting the waveform closest to the centroid of each cluster as the representative waveform for the cluster.
Similarly, U.S. Pat. No. 5,751,907, to Moebius et al., whose disclosure is incorporated herein by reference, describes a speech synthesizer having an acoustic element database that is established from phonetic sequences occurring in an interval of natural speech. The sequences are chosen so that perceptible discontinuities at junction phonemes between acoustic elements are minimized in the synthesized speech. U.S. Pat. No. 5,913,193, to Huang et al., whose disclosure is also incorporated herein by reference, describes a concatenative speech synthesis system that stores multiple instances of each acoustic unit during a training phase. The synthesizer chooses the instance that most closely resembles a desired instance, so that the need to alter the stored instance is reduced, while also reducing spectral distortion between the boundaries of adjacent instances.
U.S. Pat. No. 6,041,300, to Ittycheriah et al., whose disclosure is incorporated herein by reference, describes a speech recognition system that synthesizes and replays words that are spoken into the system so that the speaker can confirm that the word is correct. The system uses a waveform database, from which appropriate waveforms are selected, followed by acoustic adjustment and concatenation of the waveforms. For the purpose of speech recognition, the component phonemes in the spoken words are divided into sub-units, known as lefemes, which are the beginning, middle and ending portions of the phoneme. The lefemes are modeled and analyzed using Hidden Markov Models (HMMs). HMM-modeling of lefemes can also be used in speech synthesis, as described in the above-mentioned U.S. Pat. No. 5,913,193 and in Donovan's thesis.