When people are watching an audio/video file (such as a foreign movie), the language barrier usually makes a significant reading obstacle. Current film distributors can translate foreign subtitles (such as English) into local-language subtitles (such as Chinese) in a relative short period, and synchronistically distribute a movie with local-language subtitles for audiences to enjoy. However, the watching experience of most audiences can still be affected by reading subtitles, because the audience must switch rapidly between the subtitles and the scene. Especially for children, aged people, people with visual disabilities, or people with reading disabilities, the negative effect resulting from reading subtitles is particularly notable. To take audience markets in other regions into account, the audio/video file distributors may hire dubbing actors to endow the audio/video file with Chinese (or other language) dubbing. Such procedures, however, often require a long time to complete and consume great manpower effort.
Text to Speech (TTS) technology is able to convert text information into voice information. U.S. Pat. No. 5,970,459 provides a method for converting movie subtitles into local voices with TTS technology. The method analyzes the original voice data and the shape of the lips of the original speaker, converts text information into voice information with the TTS technology, then synchronizes the voice information according to the motion of the shape of lip, thereby establishing a dubbed effect in the movie. Such a scheme, however, does not make use of voice morphing technology to make the synthesized voices similar to the role players' original voices, so that the resulting dubbed effect differs greatly from the acoustic features of the original voice.
The voice morphing technology can convert the voice of an original speaker into that of a target speaker. In prior art, the frequency warping method is often used for converting the sound frequency spectrum of an original speaker into that of a target speaker, such that the corresponding voice data can be produced according to the acoustic features of the target speaker including speaking speed and tone. The frequency warping technology is a method for compensating for the difference between the sound frequency spectrums of different speakers, which is widely applied to the field of speech recognition and voice conversion. In light of the frequency warping technology, given a frequency spectrum section of a voice, the method generates a new frequency spectrum section by applying a frequency warping function, making the voice of one speaker sound like that of another speaker.
A number of automatic training methods for discovering a good-performance frequency warping function have been proposed in prior art. One method is maximum likelihood linear regression. The description of the method may be referred to: L. F. Uebel and P. C. Woodland, “An investigation into vocal tract length normalization”, EUROSPEECH' 99, Budapest, Hungary, 1999, pp. 2527-2530. However, this method needs a great amount of training data, which restricts its usage in many application situations.
Another method is to perform voice conversion with the formant mapping technology. The description of the method may be referred to: Zhiwei Shuang, Raimo Bakis, Yong Qin, “Voice Conversion Based on Mapping Formants” in Workshop on Speech to Speech Translation, Barcellona, June 2006. In particular, the method obtains a frequency warping function according to the relationship between the formants of a source speaker and a target speaker. A formant refers to some frequency areas with heavier sound intensity formed in the sound frequency spectrum due to the resonance of the vocal tract itself during pronunciation. A formant is related to the shape of the vocal tract so that the formant of each person is usually different. The formants of different speakers may be used for representing acoustic differences between different speakers. And the method also makes use of the fundamental frequency adjustment technology so that only a few training data are enough to perform frequency warping of a voice. However, the problem having not being solved by this method is that, if the voice of the original speaker differs far from that of the target speaker, the sound quality impairment resulting from the frequency warping will increase rapidly, thereby impairing the quality of the output voice.
In fact, when measuring the relative merits of voice morphing, there are two indices: one is the quality of the converted voice, another is the degree of similarity between the converted voice and the target speaker. In prior art, these two indices are often restrict by each other. It is difficult to satisfy them at the same time. That is to say, even though the current voice morphing technology is applied to the dubbing method in U.S. Pat. No. 5,970,459, it is still difficult to produce a good dubbed effect.