The invention described herein relates to a method of synthesis of audio sounds. To simplify the description, focus is mainly made on vocal sounds, keeping in mind that the invention can be applied to the field of music synthesis as well.
In the framework of the so-called "concatenative" synthesis techniques which are increasingly used, synthetic speech is produced from a database of speech segments. Segments may be diphones, for example, which begin from the middle of the stationary part of a phone (the phone being the acoustic realization of a phoneme) and end in the middle of the stationary part of the next phone. French, for instance, is composed of 36 phonemes, which corresponds to approximately 1240 diphones (as a matter of fact some combination of phonemes are impossible). Other types of segments can be used, like triphones, polyphones, half-syllables, etc. Concatenative synthesis techniques produce any sequence of phonemes by concatenating the appropriate segments. The segments are themselves obtained from the segmentation of a speech corpus read by a human speaker.
Two problems must be solved during the concatenation process in order to get a speech signal comparable to human speech.
The first problem arises from the disparities of the phonemic contexts from which the segments were extracted, which generally results in some spectral envelope mismatch at both ends of the segments to be concatenated. As a result, a mere concatenation of segments leads to sharp transitions between units, and to less fluid speech.
The second problem is to control the prosody of synthetic speech, i.e. its rhythm (phoneme and pause lengths) and its fundamental frequency (the vibration frequency of the vocal folds). The point is that the segments recorded in the corpus have their own prosody that does not necessarily correspond to the prosody imposed at synthesis time.
Hence there is a need to find a means of controlling prosodic parameters and of producing smooth transitions between segments, without affecting the naturalness of speech segments.
One distinguishes two families of methods to solve such problems: the ones that implement a spectral model of the vocal tract, and the ones that modify the segment waveforms directly in the time domain.
In the first family of synthesis methods, transitions between concatenated segments are smoothed out by computing the difference between the spectral envelopes on both sides of the concatenation point, and propagating this difference in the spectral domain on both segments. The way it controls the pitch and the duration of segments depends on the particular model used for spectral envelope estimation. All these methods require a high computational load at synthesis time, which prevents them from being implemented in real time on low-cost processors.
On the contrary the second family of synthesis methods aims to produce concatenation and prosody modification directly in the time domain with very limited computational load. All of them take advantage of the so-called "Poisson's Sum Theorem", well known among signal processing specialists which demonstrates that it is possible to build from any finite waveform with a given spectral envelope, an arbitrarily chosen (and constant) pitch. This theorem can be applied to the modification of the fundamental frequency of speech signals. Provided the spectrum of the elementary waveforms is close to the spectral envelope of the signal one wishes to modify, pitch can be imposed by setting the shift between elementary waveforms to the targeted pitch period, and by adding the resulting overlapping waveforms. In this second family, synthesis methods mainly differ in the way they derive elementary waveforms from the pre-recorded segments. However, in order to produce high-quality synthetic speech, the overlapping elementary waveforms they use must have a duration of at least twice the fundamental period of the original segments. Two classes of techniques in this second family of synthesis methods will be described hereafter.
The first class refers to methods hereafter referred to as `PSOLA` methods (Pitch Synchronous Overlap Add), characterized by the direct extraction of waveforms from continuous audio signals. The audio signals used are either identical to the original signals (the segments), or obtained after some transformation of these original signals. Elementary waveforms are extracted from the audio signals by multiplying the signals with finite-duration weighting windows positioned synchronously with the fundamental frequency of the original signal. Since the size of the elementary waveforms must be at least twice the original period, and given that there is one waveform for each period of the original signal, the same speech samples are used in several successive waveforms: the weighting windows overlap in the audio signals.
Examples of such PSOLA methods are those defined in documents EP-0363233, U.S. Pat. No. 5,479,564, EP-0706170. A specific example is also the MBR-PSOLA method as published by T. Dutoit and H. Leich, in Speech Communication, Elsevier Publisher, November 1993, Vol. 13, N.degree. 3-4, 1993. The method described in document U.S. Pat. No. 5,479,564 suggests a means of modifying the frequency of an audio signal with constant fundamental frequency by overlap-adding short-term signals extracted from this signal. The length of the weighting windows used to obtain the short-term signals is approximately equal to two times the period of the audio signal and their position within the period can be set to any value (provided the time shift between successive windows is equal to the period of the audio signal). Document U.S. Pat. No. 5,479,564 also describes a means of interpolating waveforms between segments to concatenate, so as to smooth out discontinuities. This is achieved by modifying the periods corresponding to the end of the first segment and to the beginning of the second segment, in such a way as to propagate the difference between the last period of the first segment and the first period of the second segment.
The second class of techniques, hereafter referred to as `analytic`, is based on a time-domain modification of elementary waveforms that do not share, even partially, their samples. The synthesis step still uses shifting and overlap-adding of the weighted waveforms carrying the spectral envelope information. These waveforms are no longer extracted from a continuous speech signal by means of overlapping weighting windows. Examples of these techniques are those defined in documents S. Vajma U.S. Pat. No. 5,369,730 and C. R. Lee, et al. BG 2261350 (also U.S. Pat. No. 5,617,507), as well as by T. Yazu, K. Yamada, "The speech synthesis system for an unlimited Japanese vocabulary", in proceedings IEEE ICASSP 1986, Tokyo, pp. 2019-2022.
In all these `analytic` techniques, the elementary waveforms are impulse responses of the vocal tract computed from evenly spaced speech signal frames, and resynthesized via a spectral model. The present invention falls in this class of methods, except that it uses different elementary waveforms, obtained by reharmonizing the envelope spectrum.
An advantage of analytic methods over PSOLA methods is that the waveforms they use result from a true spectral model of the vocal tract. Therefore, they can intrinsically model the instantaneous spectral envelope information with more accuracy and precision than PSOLA techniques, which simply weight a time-domain signal with a weighting window. Moreover, it is possible with analytic methods to separate the periodic (voiced) and aperiodic (unvoiced) components of each waveform, and modify their balance during the resynthesis step in order to modify the speech quality (soft, harsh, whispered, etc).
In practice, this advantage is counterbalanced by an increase of the size of the resynthesized segment database (typically a factor 2 since the successive waveforms do not share any samples while their duration still has to be equal to at least two times that of the pitch period of the audio signal). The method described by M M. Yazu and Yamada precisely aims at reducing the number of samples to be stored, by resynthesizing impulse responses in which the phases of the spectral envelope are set to zero. Only half of the waveform needs to be stored in this case, since phase zeroing results in perfectly symmetrical waveforms. The main drawback of this method is that it greatly affects the naturalness of the synthetic speech. It is well known, indeed, that producing significant phase distortion has a strong effect on speech quality.
Aim of the Invention
The present invention aims to suggest a method for audio synthesis that avoids the drawbacks presented in the state of the art and which requires limited storage for the waveforms while avoiding important distortions of the natural phase of acoustic signals.
Main Characteristic Elements of the Invention
The present invention relates to a method for audio synthesis from waveforms stored in a dictionary characterized by the following points:
the waveforms are infinite and perfectly periodic, and are stored as one of their periods, itself represented as a sequence of sound samples of a priori of any length; PA1 Synthesis is carried out by overlapping and adding the waveforms multiplied by a weighting window whose length is approximately two times the period of the original waveform, and whose position relatively to the waveform can be set to any fixed value; PA1 all the periods stored in the segment database have the same length, which leads to a very efficient period to period differential coding scheme; PA1 the use of a spectral model for spectral envelope estimation allows the separation of harmonic and stochastic components of the waveforms. When the energy of the stochastic component is low enough compared to that of the harmonic component, it may be completely eliminated, in which case only the harmonic component is resynthesized. This results in waveforms that are more pure, noiseless, and exhibit more regularity than the original signal, which additionally enhances the efficiency of ADPCM coding techniques. PA1 very efficient coding techniques which account for the fact that: PA1 Ability to produce sound variants by interpolating between base and replacement segments. For each base segment, for instance, two additional periods are stored, corresponding to the beginning and end of the segment and taken from a replacement segment. This enables the synthesis of more natural sound voices.
The time shift between two successive weighted signals obtained by weighting the original waveforms is equal to the fundamental period requested for the synthetic signal, whose value is imposed. This value may be lower or greater than that of the original waveforms.
The method according to the present invention, basically differs from any other `analytic` method by the fact that the elementary waveforms used are not full impulse responses of the vocal tract, but infinite periodic signals, multiplied by a weighting window to keep their length finite, and carrying the same spectral envelope as the original audio signals. A spectral model (hybrid harmonic/stochastic model, for instance, although the invention is not exclusively related to any particular spectral model) is used for resynthesis in order to get periodic waveforms (instead of the symmetric impulse responses of M M. Yazu and Yamada) carrying instantaneous spectral envelope information. Because of the periodicity of the elementary waveforms produced, only the first period need to be stored. The sound quality obtained by this method is incomparably superior to the one of M M. Yazu and Yamada, since the computation of the periodic waveforms does not impose phase constraints on the spectral envelopes, thereby avoiding the related quality degradation.
The periods that need to be stored are obtained by spectral analysis of a dictionary of audio segments (e.g. diphones in the case of speech synthesis). Spectral analysis produces spectral envelope estimates throughout each segment. Harmonic phases and amplitudes are then computed from the spectral envelope and the target period (i.e. the spectral envelope is sampled with the targeted fundamental frequency).
The length of each resynthesis period can advantageously be chosen equal for all the periods of all the segments. In this particular case, classical techniques for waveform compression (e.g. ADPCM) allow very high compression ratios (about 8) with very limited computational cost for decoding. The remarkable efficiency of such techniques on the waveforms obtained mainly originates from the fact that:
To further enhance the efficiency of coding techniques, the phases of the lower-order (i.e., lower frequency) harmonics of each stored period may be fixed (one phase value fixed for each harmonic of the database) for the resynthesis step. The frequency band where this setting is acceptable ranges from 0 to approximately 3 kHz. In this case, the resynthesis operation results in a sequence of periods with constant length, in which the time-domain differences between two successive periods is mainly due to spectral envelope differences. Since the spectral envelope of audio signals generally changes slowly with time, given the inertia of the physical mechanisms that produce them, the shape of the periods obtained in this way also evolve slowly. This, in turn, is particularly efficient when it comes to coding signals on the basis of period to period differences.
Independently of its use for segment coding, the idea of imposing a set of fixed values for the phases of the lower frequency harmonics leads to the implementation of a temporal smoothing technique between successive segments, to attenuate spectral mismatch between periods. The temporal difference between the last period of the first segment and the first period of the second segment is computed, and smoothly propagated on both sides of the concatenation point with a weighting coefficient continuously varying from -0.5 to 0.5 (depending on which side of the concatenation point is processed).
It should be noted that although the efficient coding properties and smoothing capabilities mentioned above were already available in the MBR-PSOLA technique as described in the state of the art, their effect is drastically reinforced in the present invention as opposed to the waveforms used by MBR-PSOLA, the periods used here do not share any of their samples, allowing a perfect separation between harmonically purified waveforms, and waveforms that are mainly stochastic.
Finally, the present invention still makes it possible to increase the quality of the synthesized audio signal by associating, with each resynthesized segment (or `base segment`), a set of replacement segments similar but not identical to the base segment. Each replacement segment is processed in the same way as the corresponding base segment, and a sequence of periods is resynthesized. For each replacement segment, for instance, one can keep two periods corresponding respectively to the beginning and the end of the replacement segment at synthesis time. When two segments are about to be concatenated, it is then possible to modify the periods of the first base segment so as to propagate, on the last periods of this segment, the difference between the last period of the base segment and the last period of one of its replacement segments. Similarly, it is possible to modify the periods of the second base segment so a to propagate, on the first periods of this segment, the difference between the first period of the base segment and the first period of one of its replacement segments. The propagation of these differences is simply performed by multiplying the differences by a weighting coefficient continuously varying from 1 to 0 (from period to period) and adding the weighted differences to the periods of the base segments.
Such a modification of the time-domain periods of a base segment so as to make it sound like one of its replacement segment can be advantageously used to produce free variants to a base sound, thereby avoiding the monotony resulting from the repeated use of a base sound. It can also be put to use for the production of linguistically motivated sound variants (e.g., stressed/unstressed vowels, tense/soft voice, etc.)
The fundamental difference between the method described in the state of the art, which according to our classification is a `PSOLA` method, and the method of the present invention originates in the particular way of deriving the periods used. As opposed to the waveforms extracted from a continuous signal as proposed in the state of the art, the waveforms used in the present invention do not share any of their samples (hence, they do not overlap). It therefore benefits from the typical advantages of other analytic methods:
periods can be harmonically purified by completely eliminating their stochastic component; PA2 when resynthesizing periods, the phase of low-frequency harmonics can bet set constant (i.e., one fixed value for each harmonic throughout the segment database)