Frequency warping, a special case of which is Vocal Tract Length Normalization (VTLN), is a well-studied method for compensating for the differences between the acoustic spectra of different speakers. It is widely used in speech recognition and voice conversion. Given a spectral cross section of one sound, the method creates a new spectral cross section by applying a frequency warping function. For speech recognition, the new cross section may directly serve as input to the recognition algorithms. In other applications, a new, modified sound may be needed. For example, in applications such as on-line game chatting, call centers, multimedia message services, etc., the frequency warping may be needed to perform the speaker identity conversion to make the voice of one speaker sound like that of another speaker. So, the original sound can be modified, for example by means of a linear filter, or a new sound may be synthesized, for example as a sum of sinusoids, to conform to the new spectral cross section.
Many automatic training methods for finding a good frequency warping function have been proposed in the prior art. One is the Maximum Likelihood Linear Regression method. A description of this method can be found in an article by L. F. Uebel, and P. C. Woodland”, entitled “An investigation into vocal tract length normalization,” EUROSPEECH' 99, Budapest, Hungary, 1999, pp. 2527-2530. However, this method requires a large amount of training data, which limits its usefulness in many applications. Another method is to use linear or piecewise linear warping functions, and to use dynamic programming to train the warping function by minimizing the distance between the converted source spectrum and the target spectrum. A description of this method can be found in an article by David Sundermann and Hermann Ney, “VTLN-Based Voice Conversion”, ICSLP, 2004, Jeju, Korea, 2004. However, few published frequency warping systems are actually based on this method because the results can be seriously degraded by noise in the input spectra.
In view of the shortcomings of the above methods, another kind of frequency warping method has been proposed that utilizes the acoustic features of the voices of speakers. Specifically, a frequency warping function is obtained based on the formant relations between the source speaker and target speaker. Formants refer to several frequency regions with higher sound intensities formed in the sound spectrum during speech due to the resonance of the vocal tract itself. Formants are related to the shape of the vocal tract, therefore each person has different formants. The matching formants between different speakers can demonstrate the difference between the different speakers.
The prior art methods for obtaining a frequency warping function by using formants typically use statistic methods to extract some statistical averages of some formant frequencies from the training speech data of the source speaker and target speaker respectively, and derive the frequency warping function based on the relationship between the statistical values of the formant frequencies of the source speaker and target speaker. This method can be seen in E. B. Gouvea and R. M. Stern, “Speaker Normalization Through Formant-Based Warping of the Frequency Scale”, 5th EUROSPEECH, Volume 3, September 1997, pages 1139-1142, and E. Eide and H. Gish, “A parametric approach to vocal tract length normalization”, Proceedings of ICASSP' 96, Atlanta, USA, 1996, 312. Considering that the formants of different phonemes uttered by the same speaker are different, there is proposed an improved method for deriving a frequency warping function by using the formants of the same phonemes to produce the matching formants, in order to reflect the difference between the different speakers.
However, because the formants and their relations with the vocal tract length (VTL) are not only dependent on the vocal tract shape of the speaker and the different phonemes uttered by the speaker, but also highly dependent on the context, the formants of the same speaker may vary significantly in different contexts. Therefore, this method of extracting formant parameters by mixing up phonemes in different contexts, though using a large amount of training data, cannot reflect the difference between the actual speech organs of the speakers, and naturally its effect is not satisfactory.
There exists a need for a new method for generating a good frequency warping function which uses a small amount of training data and which overcomes the shortcomings in the prior art.