The function of a text-to-speech (TTS) synthesis system is to synthesize speech from a generic text in a given language. Nowadays, TTS systems have been put into practical operation for many applications, such as access to databases through the telephone network or aid to handicapped people. One method to synthesize speech is by concatenating elements of a recorded set of subunits of speech such as demisyllables or polyphones. The majority of successful commercial systems employ the concatenation of polyphones. The polyphones comprise groups of two (diphones), three (triphones) or more phones and may be determined from nonsense words, by segmenting the desired grouping of phones at stable spectral regions. In a concatenation based synthesis, the conversation of the transition between two adjacent phones is crucial to assure the quality of the synthesized speech. With the choice of polyphones as the basic subunits, the transition between two adjacent phones is preserved in the recorded subunits, and the concatenation is carried out between similar phones.
Before the synthesis, however, the phones must have their duration and pitch modified in order to fulfill the prosodic constraints of the new words containing those phones. This processing is necessary to avoid the production of a monotonous sounding synthesized speech. In a TTS system, this function is performed by a prosodic module. To allow the duration and pitch modifications in the recorded subunits, many concatenation based TTS systems employ the time-domain pitch-synchronous overlap-add (TD-PSOLA) (E. Moulines and F. Charpentier, “Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones,” Speech Commun., vol. 9, pp. 453467, 1990) model of synthesis.
In the TD-PSOLA model, the speech signal is first submitted to a pitch marking algorithm. This algorithm assigns marks at the peaks of the signal in the voiced segments and assigns marks 10 ms apart in the unvoiced segments. The synthesis is made by a superposition of Hanning windowed segments centered at the pitch marks and extending from the previous pitch mark to the next one. The duration modification is provided by deleting or replicating some of the windowed segments. The pitch period modification, on the other hand, if provided by increasing or decreasing the superposition between windowed segments.
Despite the success achieved in many commercial TTS systems, the synthetic speech produced by using the TD-PSOLA model of synthesis can present some drawbacks, mainly under large prosodic variations, outlined as follows.                1. The pitch modifications introduce a duration modification that needs to be appropriately compensated.        2. The duration modification can only be implemented in a quantized manner, with a one pitch period resolution (α= . . . ,1/2,2/3,3/4, . . . ,4/3,3/2,2/1, . . . ).        3. When performing a duration enlargement in unvoiced portions, the repetition of the segments can introduce “metallic” artifacts (metallic-like sounding of the synthesized speech).        
In IEEE transactions on speech and audio processing, vol. 6, No. 5, September 1998, “A Hybrid Model for Text-to-Speech Synthesis”, Fábio Violaro and Olivier Böeffard, a hybrid model for concatenation-based, text-to-speech synthesis is described.
The speech signal is submitted to a pitch-synchronous analysis and decomposed into a harmonic component, with a variable maximum frequency, plus a noise component. The harmonic component is modeled as a sum of sinusoids with frequencies multiple of the pitch. The noise component is modeled as a random excitation applied to an LPC filter. In unvoiced segments, the harmonic component is made equal to zero. In the presence of pitch modifications, a new set of harmonic parameters is evaluated by resampling the spectrum envelope at the new harmonic frequencies. For the synthesis of the harmonic component in the presence of duration and/or pitch modifications, a phase correction is introduced into the harmonic parameters.
A variety of other so called “overlap and add” methods are known from the prior art, such as PIOLA (Pitch Inflected OverLap and Add) [P. Meyer, H. W. Rüh, R. Krüger, M. Kugler L. L. M. Vogten, A. Dirksen, and K. Belhoula. PHRITTS: A text-to-speech synthesizer for the German language. In Eurospeech '93, pages 877-890, Berlin, 1993], or PICOLA (Pointer Interval Controlled OverLap and Add) [Morita: “A study on speech expansion and contraction on time axis”, Master thesis, Nagoya University (1987), in Japanese.] These methods differ from each other in the way they mark the pitch period locations.
None of these methods give satisfactory results when applied as a mixer for two different waveforms. The problem is phase mismatches. The phases of harmonics are affected by the recording equipment, room acoustics, distance to the microphone, vowel color, co-articulation effects etc. Some of these factors can be kept unchanged like the recording environment but others like the co-articulation effects are very difficult (if not, impossible) to control. The result is that when pitch period locations are marked without taken into account the phase information, the synthesis quality will suffer from phase mismatches.
Other methods like MBR-PSOLA (Multi Band Resynthesis Pitch Synchronous OverLap Add) [T. Dutoit and H. Leich. MBR-PSOLA: Text-to-speech synthesis based on an MBE re-synthesis of the segments database. Speech Communication, 1993] regenerate the phase information to avoid phase mismatches. But this involves an extra analysis-synthesis operation that reduces the naturalness of the generated speech. The synthesis often sounds mechanic.
U.S. Pat. No. 5,787,398 shows an apparatus for synthesizing speech by varying pitch. One of the disadvantages of this approach is that since the pitch marks are centered on the excitation peaks and the measured excitation peak does not necessarily have synchronous phase, phase distortion results.
The pitch of synthesized speech signals is varied by separating the speech signals into a spectral component and an excitation component. The latter is multiplied by a series of overlapping window functions synchronous, in the case of voiced speech, with pitch timing mark information corresponding at least approximately to instants of vocal excitation, to separate it into windowed speech segments which are added together again after the application of a controllable time-shift. The spectral and excitation components are then recombined. The multiplication employs at least two windows per pitch period, each having a duration of less than one pitch period.
U.S. Pat. No. 5,081,681 shows a class of methods and related technology for determining the phase of each harmonic from the fundamental frequency of voiced speech.
Applications include speech coding, speech enhancement, and time scale modification of speech. The basic approach is to include recreating phase signals from fundamental frequency and voiced/unvoiced information, and adding a random component to the recreated phase signal to improve the quality of the synthesized speech.
U.S. Pat. No. 5,081,681 describes a method for phase synthesis for speech processing. Since the phase is synthetic the result of the synthesis does not sound natural as many aspects of the human voice and the acoustics of the surround are ignored by the synthesis.