Voice conversion is a process to convert a source speaker's speech to sound like a target speaker's speech. There are currently many applications for voice conversion. An important application is to build customized text-to speech systems for different companies, in which a TTS system with one company's favorite speech can be created quickly and inexpensively by modifying the speech corpus of an original speaker. Voice conversion can also be used for generating special character speech and keeping a speaker's identity in speech-to speech-translation, and such converted speech can be used for a variety of applications, such as movie making, online games, voice chatting, and multimedia message services. To evaluate the performance of voice conversion systems, there are usually two criteria for the converted speech: quality of converted speech and similarity to the target speaker. With state-of-art voice conversion technologies, there is typically a tradeoff between quality and similarity. Additionally, different applications lay special emphasis on quality and similarity. Generally speaking, better speech quality is an important requirement for the practical application of voice conversion technologies.
Spectral conversion is a key component in voice conversion systems. The most popular two spectral conversion methods are codebook mapping (cf. Abe, M., S. Nakamura, K. Shikano, and H. Kuwabara, “Voice Conversion through Vector Quantization,” Proc.ICASSP, Seattle, Wash., U.S.A., 1998, pp. 655-658) and Gaussian mixture model (GMM) conversion algorithm (cf. Stylianou, Y. et al., “Continuous Probabilistic Transform for Voice Conversion,” IEEE Transactions on Speech and Audio Processing, V. 6, No. 2, March 1998, pp. 131-142; and Kain, A. B., “High Resolution Voice Transformation,” Ph.D. thesis, Oregon Health and Science University, October 2001). However, although both two kinds of methods have been improved recently, the quality degradation introduced is still severe (cf. Shuang, Z. W., Z. X. Wang, Z. H. Ling, and R. H. Wang, “A Novel Voice Conversion System Based on Codebook Mapping with Phoneme-Tied Weighting,” Proc. ICSLP, Jeju, Korea, 2004). In comparison, another spectral conversion method—frequency warping—introduces less quality degradation (cf. Eichner, M., M. Wolff, and R. Hoffmann, “Voice Characteristic Conversion for TTS Using Reverse VTLN,” Pro. ICASSP, Montreal, PQ, Canda, 2004). Many works have been proposed on finding good frequency warping functions. For example, one approach was proposed by Eide, E. and H. Gish in “A Parametric Approach to Vocal Tract Length Normalization,” ICASSP 1996, Atlanta, USA, 1996, in which the warping function is based on the median of the third formant for each speaker. Some researchers extended this approach by generating warping functions based on the formants belonging to the same phoneme. However, formant frequency and its relationship with vocal tract length (VTL) are highly dependent on not only the vocal shape of a speaker and different phoneme but also the context, and could vary largely with different context for the same speaker. The Chinese patent application with a publication number of CN101004911A, filed by the same applicant, discloses a novel solution of generating a frequency warping function by mapping formant parameters of the source speaker and the target speaker, in which alignment and selection process are added to ensure the selected mapping formants can represent speakers' voice difference well. This solution requires only a very small amount of training data for generating the warping function, which can greatly facilitate its application. It can also achieve high quality of the converted speech while successfully making the converted speech similar to the target speaker. Nevertheless, listeners can still clearly perceive the difference between the converted speech and the target speaker in the speech conversion using the above solution. Such difference is caused by the detailed spectral difference, and it cannot be solved by purely frequency warping.
In the voice processing technologies, there is another speech technology, namely text-to-speech (TTS) technology. The most popular TTS technology is called concatenative TTS, where a speech database of a corpus speaker needs to be recorded first and segments of speech data of the speaker are then concatenated by unit selection to synthesize new speech data. In many commercial TTS systems, the speech database contains hours of recording. The smallest concatenation segments, or units, can be syllables, phonemes, and even 10 ms' frame of speech data.
In a typical concatenative TTS system, the sequence of candidate segments listed together the prosodic targets generated by an estimation model drive a Viterbi beam search for the sequence of units which minimize the cost function. The search aims at selecting from the sequence of candidate units the unit sequence with the least cost function. The target cost can comprise a set of cost components, e.g. the f0 cost, which measures how far the f0 contour of the unit is from that of the target; the duration cost, which measures how far the duration of the unit is from that of the target; the energy cost, which measures how far the energy of the unit is from that of the target (this component is not employed during search). The transition cost can comprise two components, one of which captures spectral smoothness across unit joins and the other of which captures pitch smoothness across spectral joins. The spectral smoothness component of this transition cost can be based on the Euclidian distance between perceptually-modified Mel cepstral coefficients. The target cost components and the transition cost components will be added together using weights which can be tuned by hand. Usually, the synthesized speech can be perceived spoken by the corpus speaker because it is concatenated by the corpus speaker's speech units in fact. However, since it is very difficult to simulate the speech generation procedure of real human, the synthesized speech is usually perceived unnatural and dull. Therefore, although traditional TTS systems preserve speaker's identity, they lose the naturalness because of the imperfect target estimation.
It is seen that speech technologies in the part art all have inherent limitations. There is a need to provide a voice conversion system providing both higher fidelity of target speech and naturalness of human speech.