Many current text-to-speech conversion systems are based on the concatenation of acoustic units taken from prerecorded speech. This approach allowed taking the quality leap necessary for using text-to-speech converters in multiple commercial applications (mainly in the generation of oral information from text in interactive voice response systems which are accessed by voice).
Although the concatenation of acoustic units allows obviating the difficult problem of completely modeling the production of human speech, it has to handle another basic problem: how to concatenate pieces of speech taken from different source files, which may have considerable differences at the concatenation points.
The possible causes of discontinuity and defects in the synthetic speech are of various types:    1. The difference in the characteristics of the spectrum of the signal at the concatenation points: frequencies and bandwidths of the formants, shape and amplitude of the spectral envelope.    2. Loss of phase coherence between the speech frames which are concatenated. They can also be seen as inconsistent relative shifts of the position of the speech frames (windows) on both sides of a concatenation point. The concatenation between incoherent frames causes a disintegration or dispersion of the waveform which is perceives as a significant loss of quality. The resulting speech is unnatural: mixed and confused.    3. Prosodic differences (intonation and duration) between the prerecorded units and the target (desired) prosody for the synthesis of an utterance.
For this reason, text-to-speech converters normally use various processes for speech signal processing which allow, after the concatenation of units, smoothly joining them at the concatenation points, and modifying their prosody so that it is continuous and natural. And all this must be done degrading the original signal as little as possible.
The most traditional text-to-speech conversion systems had a relatively reduced repertoire of units (for example, diphonemes or demisyllables), in which normally there was only one candidate for each of the possible combinations of sounds contemplated. In these systems the need to make modifications in the units is very high.
The most recent text-to-speech conversion systems are based on selecting units from a much wider inventory (corpus-based synthesis). This wide inventory has many alternatives of the different combinations between sounds, which differ in their phonetic context, prosody, position within the word and the utterance. The optimal selection of those units according to a minimum cost criterion (unit and concatenation costs) allows reducing the need to make modifications in the units, and greatly improves the quality and naturalness of the resulting synthetic speech. But it is not possible to completely eliminate the need to handle prerecorded units, because speech corpora are finite and cannot assure a complete coverage to naturally synthesize any utterance, and they will always be concatenation points.
There are different methods for speech signal representation and modification which have been used within text-to-speech converters.
The methods based on the overlap and add of speech signal windows in the time domain (PSOLA, “Pitch Synchronous Overlap and Add”, methods) are well accepted and widespread. The most classic of these methods is described in “Pitch-synchronous waveform processing techniques for text-to-speech synthesis using dyphones” (E. Moulines and F. Charpentier, Speech Communication, vol. 9, pp. 453-467, December 1990). Speech signal frames (windows) are obtained in a manner synchronous with the fundamental period (pitch). The analysis windows must be centered in the glottal closure instants (GCIs) or other identifiable points within each period of the signal, which must be carefully found and coherently labeled, to prevent phase mismatches at the concatenation points. The marking of these points is a laborious task which cannot be performed in a completely automatic manner (it requires adjustments), and conditions the good operation of the system. The modification of duration and fundamental frequency (F0) is performed by means of the insertion or deletion of frames, and the lengthening or narrowing thereof (each synthesis frame is a period of the signal, and the shift between two successive frames is the inverse of the fundamental frequency). Since PSOLA methods do not include an explicit speech signal model, it is difficult to perform the task of interpolating the spectral characteristics of the signal at the concatenation points.
The MBROLA (Multi-Band Resynthesis Overlap and Add) method described in “Text-to-Speech Synthesis based on a MBE re-synthesis of the segments database” (T. Dutoit and H. Leich, Speech Communication, vol. 13, pp. 435-440, 1993) deals with the problem of the lack of phase coherence in the concatenations by synthesizing a modified version of the voiced parts of the speech database, forcing them to have a determined F0 and phase (identical in all the cases). But this process affects the naturalness of the speech.
LPC (Linear Predictive Coding) type methods have also been proposed to perform speech synthesis, such as the one described in “An approach to Text-to-Speech synthesis” (R. Sproat and J. Olive, Speech Coding and Synthesis, pp. 611-633, Elsevier, 1995). These methods limit the quality of the speech since they involve an all-pole model. The result greatly depends on whether the original reference speech is adjusted better or worse to the suppositions of the model. It usually gives rise to problems, especially with female or child voices.
Sinusoidal type models have also been proposed, in which the speech signal is represented by means of a sum of sinusoidal components. The parameters of the sinusoidal models allow performing, in quite a direct and independent manner, both the interpolation of parameters and the prosodic modifications. In relation to assuring the phase coherence at the concatenation points, some models have chosen to handle an estimator of the glottal closure instants (a process which does not always provide good results), such as for example in “Speech Synthesis based on Sinusoidal Modeling” (M. W. Macon, PhD Thesis, Georgia Institute of Technology, October 1996). In other cases, the simplification of considering a minimum phase hypothesis (which affects the naturalness of the speech in some cases, making it be perceive as more hollow and damped) has been assumed, as in a work published by some of the inventors of this proposal: “On the Use of a Sinusoidal Model for Speech Synthesis in Text-to-Speech” (M. Á. Rodríguez, P. Sanz, L. Monzón and J. G. Escalada, Progress in Speech Synthesis, pp. 57-70, Springer, 1996).
Sinusoidal models have gradually incorporated different approaches for solving the problem of phase coherence. “Removing Linear Phase Mismatches in Concatenative Speech Synthesis” (Y. Stylianou, IEEE Transactions on Speech and Audio Processing, vol. 9, no. 3, pp. 232-239 March 2001) proposes a method for analyzing speech with windows which shift according to the F0 of the signal, but without the need for them to be centered in the GCIs. Those frames are later synchronized at a common point based on the information of the phase spectrum of the signal, without affecting the quality of the speech. The property of the Fourier Transform is applied in which adding a linear component to the phase spectrum is equivalent to shifting the waveform in the time domain. The first harmonic of the signal is forced to have a resulting phase with a value 0, and the result is that all the speech windows are coherently centered with respect to the waveform, regardless of which specific point of a period of the signal it was originally centered in. The corrected frames can thus be coherently combined in the synthesis.
For the extraction of parameters, analysis-by synthesis processes are performed such as those set forth in “An Analysis-by-Synthesis Approach to Sinusoidal Modelling Applied to Speech and Music Signal Processing” (E. Bryan George, PhD Thesis, Georgia Institute of Technology, November 1991) or in “Speech Analysis/Synthesis and Modification Using an Analysis-by-Synthesis/Overlap-Add Sinusoidal Model” (E. Bryan George, Mark J. T. Smith, IEEE Transactions on Speech and Audio Processing, vol. 5, no. 5, pp. 389-406, September 1997)
In summary, the most usual technical problems faced by text-to-speech conversion systems based on the concatenation of units are derived from the lack of phase coherence at the concatenation points between units.