While the quality of text-to-speech (TTS) synthesis has been greatly improved in the recent years, various telecommunication applications (e.g. information inquiry, reservation and ordering, and email reading) demand higher synthesis quality than current TTS systems can provide. In particular, with globalization and its accompanying mixing of languages, such applications can benefit from a multilingual TTS system in which one engine can synthesize multiple languages or even mixed-languages. Most conventional TTS systems can only deal with a single language where sentences of voice databases are pronounced by a single native speaker. Although multilingual text can be correctly read by switching voices or engines at each language change, it is not practically feasible for code-switched text in which the language changes occur within a sentence as words or phrases. Furthermore, with the widespread use of mobile phones or embedded devices, the footprint of a speech synthesizer becomes a factor for applications based on such devices.
Studies of multilingual TTS systems indicate that phonetic coverage can be achieved by collecting multilingual speech data, but language-specific information (e.g. specialized text analysis) is also required. A global phone set, which uses the smallest phone inventory to cover all phones of the languages affected, has been tried in multilingual or language-independent speech recognition and synthesis. Such an approach adopts phone sharing with the phonetic similarity measured by data-driven clustering methods or phonetic-articulatory features defined by the International Phonetic Alphabet (IPA). Intense interest exists as to small footprint aspects of TTS systems, noting that Hidden Markov Model-based speech synthesis tends to be more promising. Some Hidden Markov Model (HMM) synthesizers can have a relatively small footprint (e.g., ≦2 MB), which lends itself to embedded systems. In particular, such HMM synthesizers have been successfully applied to speech synthesis of many monolinguals, e.g. English, Japanese and Mandarin. Such an HMM approach has been applied for multilingual purposes where an average voice is first trained by using mixed speech from several speakers in different languages and then the average voice is adapted to a specific speaker. Consequently, the specific speaker is able to speak all the languages contained in the training data.
Through globalization, English words or phrases embedded in Mandarin utterances are becoming more popularly used among students and educated people in China. However, Mandarin and English belong to different language families; these languages are highly unrelated in that seldom phones can be shared together based on examination of their IPA symbols.
A bilingual (Mandarin-English) TTS is conventionally built based on pre-recorded Mandarin and English sentences uttered by a bilingual speaker where a unit selection module of the system is shared across the two languages, while phones from the two different languages are not shared with each other. Such an approach has certain shortcomings. The footprint of such a system is large, i.e., about twice the size of a single language system. In practice, it is also not easy to find a sufficient number professional bilingual speakers to build multiple bilingual voice fonts for various applications.
Various exemplary techniques discussed herein pertain to multilingual TTS systems. Such techniques can reduce a TTS system's footprint compared to existing techniques that require a separate TTS system for each language.