Speech analysis/synthesis techniques concern with analyzing speech signals to obtain an intermediate representation, and resynthesizing speech signal from such representation. Modification of speech characteristics such as pitch, duration and voice quality can be achieved by modifying the intermediate representation obtained from the analysis.
Speech analysis/synthesis system comprises an important component in speech synthesis and audio processing applications, where a high-quality parametric speech analysis/synthesis method is often required to achieve flexible manipulation of speech parameters.
The common approaches to speech analysis/synthesis are based on the source-filter model, in which the human speech production system is modeled as a pulse train signal and a set of cascaded filters including a glottal flow filter, a vocal tract filter and a lip radiation filter. The pulse train signal is a periodic repetition of a unit impulse signal at an interval of the fundamental period.
A simplied version of the source-filter model has been widely adopted in speech analysis/synthesis techniques. Such simpliciation unifies the glottal flow filter and the lip radiation filter into part of the vocal tract filter. Speech analysis/synthesis methods based on such a simplified model include PSOLA (Pitch-Synchronous OverLap Add), STRAIGHT and MLSA (Mel Log Spectrum Approximation) filter.
When the fundamental frequency of a speech signal is modified, the simplified source-filter model reveals certain defects. The glottal flow signal is proportional to the volume-velocity of the air flow though glottis and it represents the degree of the glottis contraction. Since the fundamental frequency determines the frequency of glottal oscillation, the impulse response of the glottal flow filter should match the duration of a fundamental period and the shape of such glottal flow should remain approximately invariant at different fundamental frequencies, despite that the length of a glottal flow period changes according to the fundamental frequency. However, in the simplified source-filter model, the glottal flow filter is merged into the vocal tract filter under the assumption that the glottal flow filter response is independent from the fundamental frequency. Such assumption contradicts with the physics of speech production, and as a result, after modifying the fundamental frequency parameters, speech analysis/synthesis methods based on the simplified source-filter model often fail to generate natural-sounding speech.
Recently a number of methods have been proposed to overcome the above defects. For example, SVLN (G. Degottex, et al. “Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis,” Speech Communication, vol. 55, no. 2, pp. 278294, 2013.) and GSS (J. P. Cabral, K. Richmond, J. Yamagishi, and S. Renals, “Glottal Spectral Separation for Speech Synthesis,” IEEE Journal of Selected Topics in Signal Processing, vol. 8, no. 2, pp. 195208, 2014.) In these methods glottal flow and vocal tract are separately modeled. Since the characteristics of the lip radiation filter is similar to a differentiator, the lip radiation filter is merged into the glottal flow filter, resulting in a glottal flow derivative filter. The glottal flow derivative is parameterized by a LF (Lijencrants-Fant) model. During the analysis stage, the parameters for the glottal source model are first estimated; next, the magnitude spectrum of speech is divided by the magnitude response derived from the glottal source model, after which spectral envelope estimation is performed, yielding the vocal tract magnitude response. Based on the minimum-phase assumption, the vocal tract frequency response can be computed from the vocal tract magnitude response. The synthesis stage is equivalent to the reverse of the analysis procedures and is not described here.
To a certain extent SVLN and GSS methods improve the quality of pitch-shifted speech, but there still exist several issues causing quality degradation. First, the quality of synthesized speech is affected by the accuracy of parameter estimation for the glottal model. In the case when the estimated glottal parameters deviate from the truth or are subjected to spurious fluctuations along time, the resynthesized speech could contain glitches or sound different from the original speech signal. Another issue with methods based on a parametric glottal model is the limited expressivity of the glottal model, that some certain types of glottal flow patterns may not be covered by the parameter space. In such a situation, an approximated glottal flow pattern is used instead, which eventually leads to poorly reconstructed speech.
A recently proposed speech analysis/synthesis method, HMPD (G. Degottex and D. Erro, A uniform phase representation for the harmonic model in speech synthesis applications, EURASIP Journal on Audio, Speech, and Music Processing, vol. 2014, no. 1, 2014.) does not require a procedure for glottal source model parameter estimation and is thus more robust to a certain extent. Based on harmonic model, the analysis stage of HMPD first estimates the vocal tract phase response; next, the vocal tract component is subtracted from the vector of harmonic phases and the glottal source phase response at each harmonic is obtained. Finally, phase distortion of the glottal source, a feature similar to group delay function is computed. When performing pitch modification, the phase distortion is first unwrapped and then interpolated according to the new fundamental frequency. A problem with such an approach is that the phase unwrapping operation is prone to errors, especially on high-pitched speech where the operation is likely to generate speech parameter sequences that are discontinuous across frames. In addition, such approach assumes that the glottal source has a uniform magnitude response and as a result, the method does not model the influence of fundamental frequency on the magnitude response of the glottal flow filter.
Based on a harmonic model, the present invention decomposes the harmonic model parameters into glottal source and vocal tract components. Utilizing the shape-invariant propertiy of glottal flow signals, by preserving the difference between the phases of the glottal soruce harmonics and the phases generated from a glottal flow model, the present invention effectively reduces the impact of glottal flow parameter estimation accuracy on the quality of synthesized speech. A simplified variant of the present method implicitly models the glottal source characteristics without depending on any specific parametric glottal flow model and thus simplifies the speech analysis/synthesis procedures. The method and its variant disclosed in the present invention do not involve phase unwrapping operation, therefore avoiding the problem of discontinuous speech parameters. In the case when the speech parameters are unmodified, the method and its variant disclosed in the present invention do not introduce harmonic amplitude or phase distortion, guaranteeing perfect reconstruction of harmonic model parameters.