In the description below, detailed references are given in the list of documents at the end of the description for documents cited with the reference in abbreviated form in square brackets ([ . . . ]).
Digitized speech modification techniques prove very useful in numerous speech processing applications. In speech synthesis, they provide prosody modifications (modification of pitch and rhythm) that are often necessary to confer an acceptable intonation on a synthesized speech signal. In the field of voice conversion, the objective is to modify the speech signal from a source speaker so that it appears to have been spoken by a required target speaker. For this, adaptation of timbre and pitch are necessary. There are also voice transformation applications seeking to modify perceived speech only on the basis of a set of target descriptors (low/high voice, masculine/feminine/child-like voice, robot voice, etc.).
Most known speech modification techniques essentially aim to modify three types of parameters:
Perceived pitch, measured by the fundamental frequency of the speech signal concerned, i.e. the frequency of vibration of the vocal chords.
Speed, directly related to the time taken to pronounce the various phonemes of the speech signal concerned. This time could be the total duration of an ordinary sentence, for example.
Timbre, which can be defined as the perceptual attribute that characterizes the difference between two sounds otherwise similar in terms of pitch, intensity, and duration. The timbre comprises both an information component (linked to the phonemes spoken) and an identity component (linked to the speaker: for example, a voice that is hoarse, clear, gentle, etc.). The timbre is often described by the spectral envelope of the speech signal. The spectral envelope is the envelope curve of the amplitudes of the spectrum peaks seen in the speech signal.
The above three parameter types are not independent of one another, in the sense that a modification applied to one of these parameters necessarily affects the others. This implies modifying these parameters consistently. In particular, combined modification of pitch and timbre is necessary to preserve the natural sound of the resulting speech. For example, it is demonstrated in the document [Syr85] (see list of reference documents at the end of the description) that the first formant and the fundamental frequency are closely linked, so that any change to one of these parameters must be accompanied by an appropriate modification to the other. A formant corresponds to a resonance of the vocal tract, and is characterized by its center frequency and its bandwidth. That center frequency is reflected by a peak in the spectral envelope.
Speech signal modification techniques that modify the perceived pitch without at the same time modifying the timbre are known. They include the TD-PSOLA and HNM techniques, for example.
The TD-PSOLA (Time Domain Pitch Synchronous Overlap and Add) technique described in European Patent EP0363233, for example, or in the document [Mou95], is based on decomposing a speech signal into short-term and pitch-synchronous analysis signals that are then repositioned on the time axis and juxtaposed progressively. The TD-PSOLA technique makes prosody modifications to the speech signal such as duration expansion/contraction (known as time-stretching) or changing the fundamental frequency (pitch), while at the same time preserving good sound quality. Here “good sound quality” means the absence of breaks, noise, or other artifacts that make a signal uncomfortable for a listener. Thus it does not include the natural aspect of the voice timbre.
However, with the TD-PSOLA technique, although the time-stretching factors used can be as high as 2 without significant distortion of the signal, the possibilities for modifying the fundamental frequency remain relatively limited if the resulting speech signal is to sound natural. In the TD-PSOLA technique, modification of pitch is not accompanied by modification of timbre. As mentioned above, combined modification of pitch and timbre is necessary to preserve the natural sound of the resulting speech.
The voice modification technique based on the HNM model is described in the document [Sty96], for example. The harmonic plus noise model (HNM) has also been used for prosody modification and even for spectral modification. It assumes that a voiced segment (also known as a frame) of the speech signal S(n) can be decomposed into a harmonic portion, representing the quasi-periodic component of the signal consisting of a sum of L harmonic sinusoids each of amplitude AI and phase ΦI, and a noise portion representing friction noise and glottal excitation variation from one period to another, modeled by Gaussian white noise exciting an AR (auto-regressive) filter obtained by linear predictive coding (LPC) analysis. For a non-voiced frame, the harmonic portion is absent and the signal is simply modeled by white noise shaped by AR filtering. For synthesis, the amplitude and the phase of the harmonic portion are re-estimated as a function of the required pitch instructions to preserve the timbre of the original signal (i.e. the spectral envelope) as much as possible. This re-estimation is valid for the amplitude information, provided that a sufficiently smooth spectral envelope is available. However, re-estimating phase is much more complex and must allow for phase spectra of the glottal source and the filter characterizing the vocal tract, this information being difficult to extract in both cases. This problem means that the harmonic plus noise model fails to preserve the coherence of the signals that are modified and therefore degrades the quality of the resulting speech.
Unlike the above techniques, other known voice modification techniques operate on perceived pitch and on timbre.
The resampling technique adapts a signal (not necessarily a speech signal) to modification of its sampling frequency. Applied to a speech signal, this technique modifies pitch, timbre, and speed conjointly, preserving excellent sound quality. The resampling technique is described in the document [Mou95]. According to that document, to obtain an integer signal acceleration factor P, low-pass filtering is applied first, after which the signal is decimated by eliminating P-1 samples per P samples. To obtain an audio or speech signal slowing factor Q (Q integer), Q-1 zeros are added between two signal samples, after which low-pass filtering with an appropriate cut-off frequency is applied.
As a general rule, the resampling factor γ is not an integer, but can be approximated by a rational number P/Q. When γ=P/Q, it suffices to combine the two kinds of processing: oversampling by a factor Q followed by undersampling by a factor P.
Generally speaking, if the resampling factor γ applied is greater than (or less than) 1, the amplitude spectrum of the speech signal is expanded (or contracted), i.e. the position of harmonics and formants of the signal, represented on the frequency axis, are multiplied (or divided) by γ. This kind of spectral transformation therefore affects timbre and is also accompanied by multiplication (or division) of the fundamental frequency by the same coefficient (γ), and therefore acts conjointly on pitch. Resampling is consequently an effective and relatively simple technique for modifying a speech signal, because it modifies timbre and pitch conjointly, with no audible artifacts appearing, because resampling preserves the time coherence of the signal and therefore does not distort the information conveyed.
However, resampling alone cannot effect relevant transformations of fundamental frequency and timbre. Resampling the speech signal causes formants to be shifted pro rata in the same direction as the fundamental frequency. Observation of natural speech signals shows that the range of fundamental frequency variation is much wider than the range of variation of formant frequencies. Applying a resampling factor equal to the required fundamental frequency modification factor is therefore reflected in excessive expansion/contraction of the spectral envelope and therefore significantly degrades the natural sound of the voice, for example causing “pipe voice” or “Donald Duck voice” effects.
Another known technique operates conjointly on perceived pitch and timbre. This technique is described in the document [Kai00] and relies on a spectrum adjustment operation based on the use of a Gaussian mixture model to model pitch and spectral envelope conjointly. Accordingly, the spectral envelope is corrected as a function of the required fundamental frequency instruction, which preserves the natural sound of the transformed speech better, especially if large fundamental frequency modifications are made. This type of technique effects amplitude spectrum transformations that are relatively accurate and well-controlled. However, the phase information of the transformed signals is not well-controlled, which significantly degrades the quality of the resulting signal.
It emerges from the prior art as briefly described above that there is a real need for a speech signal modification technique that modifies conjointly at least the perceived pitch and the timbre associated with the speech signal in order to provide a speech signal of high quality in terms of the perceived resulting voice sounding natural.