The general framework of a modern commercial TTS system is shown in FIG. 1. An input text—for example “Hello World”—is transformed into a linguistic description using linguistic resources in the form of lexica, rules and n-grams. The text normalization step converts special characters, numbers, abbreviations, etc. into full words. For example, the text “123” is converted into “hundred and twenty three”, or “one two three”, depending on the application. Next, linguistic analysis is performed to convert the orthographic form of the words into a phoneme sequence. For example, “hello” is converted to “h@-1oU”, using a phonetic alphabet. Further linguistic rules enable the TTS program to assign intonation markers and rhythmic structure to the sequence of words or phonemes in a sentence. The end product of the linguistic analysis is a linguistic description of the text to be spoken. The linguistic description is the input to the speech generation module of a TTS system.
State of the art TTS systems use one of two methods to generate a speech signal. The unit selection method is based on the selection of speech segments (or “units”) from a large database of speech segments. The segments typically stem from a single speaker or voice talent. The speaker typically records several hundred sentences, corresponding to several hours of speech. The HMM-based speech synthesis method is based on generating speech parameters using a statistical model. The statistical model is trained on a database of recorded speech. The speech can stem from multiple speakers. Speaker adaptation techniques are converting the voice identity of speech generated by a statistical model.
EP 1 835 488 B1 discloses a method for converting an input linguistic description into a speech waveform comprising the steps of deriving at least one target unit sequence corresponding to the linguistic description, selecting from a waveform unit database a plurality of alternative unit sequences approximating the at least one target unit sequence, concatenating the alternative unit sequences to alternative speech waveforms and choosing one of the alternative speech waveforms by an operating person. Finding the best speech waveform depends on a operating person.
There is an increasing demand for TTS systems that can render texts with foreign language inclusions. For example, in German texts English terminology is commonly used, or reference is made to songs or movie titles in a language like French or Italian. In the navigation domain, drivers crossing a language border expect their navigation system to pronounce foreign location names. The pronunciation should be intelligible and match the driver's proficiency in the foreign language, while using the voice the driver has selected.
State of the art TTS systems are typically designed for mono-lingual text input. Several TTS systems support multiple languages, but during operation only one language is activated at a time. Foreign language inclusions are often pronounced using letter-to-sound rules that are valid for the activated language but are inappropriate for the foreign language. For example, the French location name “Bois de Boulogne” may be pronounced by a mono-lingual English TTS as “Boys day Bow-log-gen”
A known approach to support mixed lingual input is to include a list of foreign words in the pronunciation lexicon of a mono-lingual TTS. The foreign words are transcribed using the phoneme set of the native language. Then, “Bois de Boulogne” may be pronounced as “Bwa duh Boo-lon-je”. This approach still has several disadvantages. The added word list requires memory space, which is costly for embedded systems. The list has to be derived for each combination of a foreign and native language, where the foreign language determines which words need to be added, and the native language determines the phoneme set to be used for the transcription. Finally, the foreign word list is finite, leaving words or names in the foreign language without a proper transcription.
Another shortcoming of state of the art TTS systems is that the output speech signal does not contain phonemes that occur in a foreign language but not in the native language. For example the German vowel “ü” or the Spanish “rr” cannot be produced by an English TTS system. Human speakers, on the other hand, often become quite proficient at pronouncing foreign phonemes, even if they did not learn to pronounce these sounds during childhood. For example, many German speakers have no problem pronouncing the rhoticised English “r” in “great” or “road”. Human listeners with a proficient foreign language skill find a TTS system that does not mimic their phonetic proficiency level simplistic or primitive.
In US 2007/0118377 A1 (Badino et al), in CAMPBELL Nick: “Talking Foreign Concatenative Speech Synthesis and the Language Barrier” EUROSPEECH 2001, vol. 1, 2001, page 337, XP007005007 AALBORG, DK and in CAMPBELL Nick: “Foreign-Language Speech Synthesis” PROCEEDINGS OF ESCA/COCOSDA WORKSHOP ON SPEECH SYNTHESIS, 26 Nov. 1998-29 Nov. 1998 pages 177-180, XP002285739 JENOLA CAVES HOUSE, Australia, phoneme mapping mechanisms are described. The phoneme mapping in Badino and in Campbell is based on phonetic feature vectors, where the vector components are acoustic characteristics. For each phoneme in each language a set of phonetic articulatory features is defined, such as voicedness, place of articulation, and manner of articulation (vowel, consonant, diphthong, unstressed/stressed, long, nasalized, rounded, front, central, back, plosive, nasal, trill, tapflap, fricative, lateral, affricate, bilabial, labiodentals, dental, alveolar, palatal, uvular, glottal, aspirated, semiconsonant . . . ). A distance measure is defined as a weighted combination of feature differences. The disadvantages of this approach are that the weights are difficult to tune and it is often found that the phoneme mapping with the lowest distance is not perceptually optimal. The approach also requires the distance calculations to be repeated for each speech unit.
According to Badino the comparison between phonemes is carried out for each phoneme pair by comparing the corresponding vectors, allotting respective scores to the vector-to-vector comparisons. Using vectors of 8 IPA standard components causes a time consuming comparison.
Campbell discloses in “Talking Foreign” the use of mapping vectors based not on direct phone-to-phone pairs, but on vectors of articulatory features. In “Foreign-Language Speech Synthesis” it is disclosed that the waveform data (or its cepstral transform) is taken as a model to specify the acoustic characteristics of the desired speech. The acoustic similarity is compared by scoring every candidate phone.
Acoustic characteristics (vectors) have to be known for all phonemes to be compared. The provided pronunciations of the foreign words sound simplistic to listeners with a more proficient knowledge of the foreign language. The known method is reduced to express only a limited number of known words of a foreign language with waveform data of the native language.
Romsdorfer et al, Text Analysis and Language Identification for Polyglot Text-to-Speech Synthesis, Speech Communication, Vol 49/9, pp. 697-724, September 2007, describe a method to predict an accurate foreign transcription for words of a foreign language. The method is based on a modular approach that integrates foreign language morphology and syntax components in a joint framework. The method produces a linguistic description that can be sent to a speech generation module. However the approach does not teach a method to synthesize foreign phonemes that are not in the phoneme set of the database speaker. It is necessary to map foreign phonemes to phonemes spoken by the database speaker. Romsdorfer et al have used multi-lingual speakers to cover the phonemes of the languages in their mixed-lingual system. It is clear that this approach does not generalize well, as one requires voice talents for each native language to be proficient at each foreign language that is to be supported by the TTS system.
To overcome the phoneme set limitation of a given TTS voice, a known solution is to switch voices for foreign language text inclusions. This solution introduces an undesirable break between parts of the sentence in the native language and parts of the sentence in a foreign language. Often a long pause is introduced and the intonation flow is interrupted. Moreover the voice identity changes, which is unnatural for example in the navigation domain (“Turn right onto the Bahnhofstrasse”) or for entertainment announcements (“You are listening to <<Les Champs Elysées>> by Joe Dassin”).
There have been some proposals in the literature to enrich voice databases with foreign sounds. For example, Conkie and Syrdal, “Expanding Phonetic Coverage in Unit Selection Synthesis through Unit Substitution from a Donor Voice”, Proceedings Interspeech, 2006, excise a “th” sound from an English voice database and add it to a South American Spanish database. Unfortunately this approach only produces satisfactory results for unvoiced sounds. For voiced and especially sonorant sounds, the speaker identity of the donated units interferes with the identity of the native voice. Donated sonorant units also can introduce large concatenation errors because their phonetic quality does not match that of the native units.
Latorre et al., “New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer”, Speech Communication, vol. 48, no. 10, pp. 1227-1242, October 2006, describe an approach where the acoustic models of an HMM synthesizer are trained from recordings of multiple speakers speaking multiple languages. The acoustic models are then matched to a target speaker using statistical speaker adaptation. However, the quality of HMM-based speech synthesis is lower than the quality of unit selection-based synthesis. This is due to the fact that the statistical modeling approach cannot preserve detailed information necessary for high fidelity speech.