Speech segment concatenation is often used as part of speech generation and modification algorithms. For example, many Text-To-Speech (TTS) applications concatenate pre-stored speech segments in order to produce synthesized speech. Also, some Time Scale Modification (TSM) systems fragment input speech into small segments and rejoin the segments after repositioning. Junctions between speech segments are a possible source of degradation in speech quality. Thus, signal discontinuities at each junction should be minimized.
Speech segments can be concatenated either in the time-, frequency- or time-frequency-domain. The present invention is about time-domain concatenation (TDC) of digital speech waveforms. High quality joining of digital speech waveforms is important in a variety of acoustic processing applications, including concatenative text-to-speech (TTS) systems such as the one described in U.S. patent application Ser. No. 09/438,603 by G. Coorman et al.; broadcast message generation as described, for example, in L. F. Lamel, J. L. Gauvain, B. Prouts, C. Bouhier & R. Boesch, “Generation and Synthesis of Broadcast Messages,” Proc. ESCA-NATO Workshop on Applications of Speech Technology, Lautrach, Germany, September 1993; implementing carrier-slot applications, as described, for example, in U.S. Pat. No. 6,052,664 by S. Leys, B. Van Coile and S. Willems; and Time-Scale Modifications (TSM) as described, for example, in U.S. patent application Ser. No. 09/776,018, G. Coorman, P. Rutten, J. De Moortel and B. Van Coile, “Time Scale Modification of Digitally Sampled Waveforms in the Time Domain,” filed Feb. 2, 2001; all of which are hereby incorporated herein by reference.
TDC avoids computationally expensive transformations to and from other domains, and has the further advantage of preserving intrinsic segmental information in the waveform. As a consequence, for longer speech segments, the natural prosodic information (including the micro-prosody-one of the key factors for highly natural sounding speech) is transferred to the synthesized speech. One major concern of TDC is to avoid audible waveform irregularities such as discontinuities and transients that may occur in the neighborhood of the join. These are commonly referred as “concatenation artifacts”.
To avoid concatenation artifacts, two speech segments can be joined together by fading-out the trailing edge of the left segment and fading-in the leading edge of the right segment before overlapping and adding them. In other words, smooth concatenation is done by means of weighted overlap-and-add, a technique that is well known in the art of digital speech processing. Such a method has been disclosed in U.S. Pat. No. 5,490,234 by Narayan, incorporated herein by reference.
Thus, rapid and efficient synchronization of waveforms helps achieve real time high quality TDC. The length of the speech segments involved depends on the application. Small speech segments (e.g. speech frames) are typically used in time-scale modification applications while longer segments such as diphones are used in text-to-speech applications and even longer segments can be used in domain specific applications such as carrier slot applications.
Some known waveform synchronization techniques address waveform similarity as described in W. Verhelst & M. Roelands, “An Overlap-Add Technique Based on Waveform Similarity (WSOLA) for High Quality Time-Scale Modification of Speech,” ICASSP-93. IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 554–557, Vol. 2, 1993; incorporated herein by reference. In the following, waveform synchronization methods used in TDC that makes use of the waveform shape will be described. This type of synchronization minimizes waveform discontinuities in voiced speech that could emerge when joining two speech waveform segments.
A common method of synthesizing speech in text-to-speech (TTS) systems is by combining digital speech waveform segments extracted from recorded speech that are stored in a database. These segments are often referred in speech processing literature as “speech units”. A speech unit used in a text-to-speech synthesizer is a set consisting of a sequence of samples or parameters that can be converted to waveform samples taken from a continuous chunk of sampled speech and some accompanying feature vectors (containing information such as prominence level, phonetic context, pitch . . . ) to guide the speech unit selection process, for example. Some common and well described representations of speech units used in concatenative TTS systems are frames as described in R. Hoory & D. Chazan, “Speech synthesis for a specific speaker based on labeled speech database”, 12thInternational Conference On Pattern Recognition 1994, Vol. 3, pp. 146–148, phones as described in A. W. Black, N. Campbell, “Optimizing selection of units from speech databases for concatenative synthesis,” Proc. Eurospeech '95, Madrid, pp. 581–584, 1995, diphones as described in P. Rutten, G. Coorman, J. Fackrell & B. Van Coile, “Issues in Corpus-based Speech Synthesis”, Proc. IEE symposium on state-of-the-art in Speech Synthesis, Savoy Place, London, April 2000, demi-phones as described in M. Balestri, A. Pacchiotti, S. Quazza, P. L. Salza, S. Sandri, “Choose the best to modify the least: a new generation concatenative synthesis system,” Proc. Eurospeech '99, Budapest, pp. 2291–2294, September 1999 and longer segments such as syllables, words and phrases as described in E. Klabbers, “High-quality speech output generation through advanced phrase concatenation”, Proc. of the COST Workshop on Speech Technology in the Public Telephone Network: Where are we today?, Rhodes, Greece, pages 85–88, 1997, all of which are incorporated herein by reference.
A well known speech synthesis method that implicitly uses waveform concatenation is described in a paper by E. Moulines and F. Charpentier “Pitch-Synchronous Waveform Processing Techniques for Text-to-Speech Synthesis Using Diphones”, Speech Communication, Vol. 9, No. 5/6, December 1990, pages 453–467, incorporated herein by reference. That paper describes a technique known as TD-PSOLA (Time-Domain Pitch-Synchronous Over-Lap and Add) that is used for prosody manipulation of the speech waveform and concatenation of speech waveform segments. A TD-PSOLA synthesizer concatenates windowed speech segments centered on the instant of glottal closure (GCI) that have a typical duration of two pitch periods. Several techniques have been used to calculate the GCI. Amongst others:                B. Yegnanarayana and R. N. J. Veldhuis, “Extraction Of Vocal-Tract System Characteristics From Speech Signals”, IEEE Transactions on Speech and Audio Processing, Vol. 6, pp. 313–327, 1998;        C. Ma, Y. Kamp & L. Willems, “A Frobenius Norm Approach To Glottal Closure Detection From The Speech Signal”, IEEE Transactions on Speech and Audio Processing, 1994;        S. Kadambe and G. F. Boudreaux-Bartels, “Application Of The Wavelet Transform For Pitch Detection Of Speech Signals”, IEEE Transactions on Information Theory, vol. 38, no 2, pp. 917–924, 1992;        R. Di Francesco & E. Moulines, “Detection Of The Glottal Closure By Jumps In The Statistical Properties Of The Signal”, Proc. of Eurospeech '89, Paris, vol. 2, pp. 39–41, 1989; all incorporated herein by reference.        
In PSOLA synthesis, diphone concatenation is performed by means of overlap-and-add (i.e. waveform blending). The synchronization is based on a single feature, namely the instant of glottal closure (pitch markers, GCI). The PSOLA method is fast and lends itself to off-line calculation of the pitch markers leading to very fast synchronization. A disadvantage of this technique is that phase differences between segment boundaries may cause waveform discontinuities and thus may lead to audible clicks. A technique which aims to avoid such problems is the MBROLA synthesis method that is described in T. Dutoit & H. Leich, “MBR-PSOLA: Text-to-Speech Synthesis Based on an MBE Re-Synthesis of the Segments Database”, Speech Communication, Vol. 13, pages 435–440, incorporated herein by reference. The MBROLA technique pre-processes the segments of the inventory by equalization of the pitch period over the complete segment database and by resetting the low frequency phase components to a pre-defined value. This technique facilitates spectral interpolation. MBROLA has the same computational efficiency as PSOLA and its concatenation is smoother. However MBROLA makes the synthesized speech more metallic sounding because of the pitch-synchronous phase resets.
In the field of corpus-based synthesis another efficient segment concatenation method has been proposed recently in Y. Stylianou, “Synchronization of Speech Frames Based on Phase Data with Application to Concatenative Speech Synthesis,” Proceedings of 6th European Conference on Speech Communication and Technology, Sep. 5–9, 1999, Budapest, Hungary, Vol. 5, pp. 2343–2346, incorporated herein by reference. Stylianou's method is based on the calculation of the center of gravity. This method is somewhat similar to the epoch estimation method used for TD-PSOLA synthesis but is more robust since it does not rely on an accurate pitch estimate.
Another efficient waveform synchronization technique described in S. Yim & B. I. Pawate, “Computationally Efficient Algorithm for Time Scale Modification (GLS-TSM)”, IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, pp. 1009–1012 Vol. 2, 1996, incorporated herein by reference, (see also U.S. Pat. No. 5,749,064) is based on a cascade of a global synchronization with a local synchronization based on a vector of signal features.
In the method described in B. Lawlor & A. D. Fagan, “A Novel High Quality Efficient Algorithm for Time-Scale Modification of Speech,” Proceedings of Eurospeech conference, Budapest, Vol. 6, pp. 2785–2788, 1999, incorporated herein by reference, the largest peaks or troughs are used as a synchronization criterion.