This invention relates to speech synthesis.
In the context of speech synthesis that is based on Concatenation of acoustic units, speech signals may be encoded by speech models. These models are required if one wishes to ensure that the concatenation of selected acoustic units results in a smooth transition from one acoustic unit to the next. Discontinuities in the prosody (e.g., pitch period, energy), in the formant frequencies and in their bandwidths, and in phase (inter-frame incoherence) would result in unnatural-sounding speech.
In, “Time-Domain and Frequency-Domain Techniques for Prosodic Modifications of Speech,” chapter 15 in “Speech Coding and Synthesis,” edited by W. B. Kleijn and K. K. Paliwal, Elsevier Science, 1995 pp, 519–555, E. Moulines et al, describe an approach which they call Time-Domain Pitch Synchronous Overlap Add (TD-PSOLA) that allows time-scale and pitch-scale modifications of speech from the time domain signal. In analysis, pitch marks are synchronously set on the pitch onset times, to create preselected, synchronized, segments of speech. On synthesis, the preselected segments of speech are weighted by a windowing function and recombined with overlap-and-add operations. Time scaling is achieved by selectively repeating or deleting speech segments, while pitch scaling is achieved by stretching the length and output spacing of the speech segments.
A similar approach is described in U.S. Pat. No. 5,327,498, issued Jul. 5, 1994.
Because TD-PSOLA does not model the speech signal in any explicit way, it is referred to as “null” model. Although it is very easy to modify the prosody of acoustic units with TD-PSOLA, its non-parametric structure makes their concatenation a difficult task.
T. Dutoit et al, in “Text-to-Speech Synthesis Based on a MBE Re-synthesis of the Segments Database,” Speech Communication, vol. 13, pp. 435–440, 1993, tried to overcome concatenation problems in the time domain by re-synthesizing voiced parts of the speech database with constant phase and constant pitch. During synthesis, speech frames are linearly smoothed between pitch periods at unit boundaries.
Sinusoidal model approaches have also been proposed also for synthesis. These approaches perform concatenation by making use of an estimator of glottal closure instants. Alas, it is a process that is not always successful. In order to assure inter-frame coherence, a minimum phase hypothesis has been used sometimes.
LPC-based methods, such as impulse driven LPC and Residual Excited LP (RELP), have been also proposed for speech synthesis. In LPC-based methods, modifications of the LP residuals have to be coupled with appropriate modifications of the vocal tract filter. If the interaction of the excitation signal and the vocal tract filter is not taken into account, the modified speech signal is degraded. This interaction seems to play a more dominant role in speakers with high pitch (e.g., female and child voice). However, these kinds of interactions are not fully understood yet and, perhaps consequently, LPC-based methods do not produce good quality speech for female and child speakers. An improvement of the synthesis quality in the context of LPC can be achieved with careful modification of the residual signal, and such a method has been proposed by Edgington et al in “Overview of current text-to-speech Techniques: Part II—Prosody and Speech Generation,” Speech Technology for Telecommunications, Ch 7, pp. 181–210, Chapman and Hall, 1998. The technique is based on pitch-synchronous re-sampling of the residual signal during the glottal open phase (a phase of the glottal cycle which is perceptually less important) while the characteristics of the residual signal near the glottal closure instants are retained.
Most of the previously reported speech models and concatenation methods have been proposed in the context of diphone-based concatenative speech synthesis. Recently, an approach for synthesizing speech by concatenating non-uniform units selected from large speech databases has been proposed by numerous artisans. The aim of these proposals is to reduce errors in modeling of the speech signal and to reduce degradations from prosodic modifications using signal-processing techniques. One such proposal is presented by Campbell, in “CHATR: A High-Definition Speech Re-Sequencing System,” Proc. 3rd ASA/ASJ Joint Meeting, (Hawaii), pp. 1223–1228, 1996. He describes a system that uses the natural variation of the acoustic units from a large speech database to reproduce the desired prosodic characteristics in the synthesized speech. This requires, of course, a process for selecting the appropriate acoustic unit, but a variety of methods for optimum selection of units have been proposed. See, for instance, Hunt et al, “Unit Selection in a Concatenative Speech Synthesis System Using Larger Speech Database,” Proc. IEEE int. Conf. Acoust., Speech, Signal Processing, pp. 373–376, 1996, where a target cost and a concatenation cost is attributed in each candidate unit, where the target cost is the weighted sum of the differences between elements such as prosody and phonetic context of the target candidate units. The concatenation cost is also determined by the weighted sum of cepstral distances at the point of concatenation and the absolute differences in log power and pitch. The total cost for a sequence of units is the sum of the target and concatenation coats. The optimum unit selection is performed with a Viterbi search. Even though a large speech database is used, it is still possible that a unit (or a sequence of units) with a large cost has to be selected because a better unit (e.g., with prosody closer to the target values) is not present in the database. This results in a degradation of the output synthetic speech. Moreover, searching large speech databases can slow down the speech synthesis process.
An improvement of CHATR has been proposed by Campbell in “Processing a Speech Corpus for CHATR Synthesis,” Proc. of ICSP'97, pp. 183–186, 1997 by using sub-phonemic waveform labeling with syllabic indexing (reducing, thus, the size of the waveform inventory in the database). Still, a problem exists when prosodic variations need to be performed in order to achieve natural-sounding speech.