The use of multiple languages in an article or a sentence is not uncommon, for example, the use of both English and Mandarin in text. When people need to transform the multi-lingual text into speech via synthesis, taking the contextual scenario into account is important when deciding how to process the text of non-native language. For example, in some scenario, the use of the non-native language with a slight hint of native language accent would sound more natural, such as, the multi-lingual sentences in e-books or e-mails to friends. The current multi-lingual text-to-speech (TTS) systems often use a plurality of synthesizers to switch for different languages; hence, the synthesized speech often includes speeches spoken by different people when multi-lingual text appears, and suffers the problem of interrupted prosody of speech.
Several documents have been disclosed on the subject of multi-lingual TTS. For example, U.S. Pat. No. 6,141,642 disclosed a TTS apparatus and method for processing multiple languages, by switching between multiple synthesizers for multi-lingual text.
Some patents disclosed techniques of mapping non-native language phonetics directly to native language phonetics without considering the difference of the acoustic-prosodic models between different languages. Some patents disclosed techniques of merging similar parts of acoustic-prosodic models of different languages and keeping the different parts without considering the weight of accents. Some papers disclosed techniques of, such as, HMM-based mixed-language, e.g., Mandarin-English, speech synthesizer also without considering accents.
A paper titled “Foreign Accents in Synthetic speech: Development and Evaluation” uses different phonetic mapping to handle the accent issue. Two other papers, “Polyglot speech prosody control” and “Prosody modification on mixed-language speech synthesis” handles the prosody issue, but not the acoustic-prosodic model issue. The paper, “New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer” uses acoustic-prosodic model adaption to construct non-native language acoustic-prosodic model, but not discloses the manner to control the weight of accent.