Cross-lingual voice transformation is the process of transforming the characteristics of a speech uttered by a source speaker in one language (L1 or first) into speech which sounds like speech uttered by a target speaker by using the speech data of the target speaker in another language (L2 or second). In this way, cross-lingual voice transformation may be used to render the target speaker's speech in a language that the target speaker does not actually speak.
Conventional cross-lingual voice transformations may rely on the use of phonetic mapping between a source language and a target language according to the International Phonetic Alphabet (IPA), or acoustic mapping using a statistical measure such as the Kullback-Leibler Divergence (KLD). However, phonetic mapping or acoustic mapping between certain language pairs, such as English and Mandarin Chinese, may be difficult due to phonetic and prosodic differences between the language pairs. As a result, cross-lingual voice transformation based on the use of phonetic mapping or acoustic mapping may yield synthesized speech that is unnatural sounding and/or unintelligible for certain language pairs.