In general, various types of speech applications can be implemented using ASR (automatic speech recognition) systems and TTS (text-to-speech) synthesis systems. As is known in the art, ASR systems are typically implemented in speech-based systems to enable machine recognition of speech input by a user and thereby enable user command and control and conversational interaction with the system. Moreover, TTS systems operate by converting textual data (e.g., a sequence of one or more words) into an acoustic waveform which can be output as a spoken utterance. TTS systems can be used in interactive voice response (IVR) systems, for example, to provide spoken output to a user.
In general, ASR systems are implemented using an acoustic vocabulary and a language vocabulary. In a language vocabulary (or word vocabulary), words are represented with an ordinary textual alphabet. In an acoustic vocabulary, the spoken sounds of words are represented by an alphabet consisting of a set of phonemes. The words that comprise the acoustic vocabulary are referred to as base forms. These base forms can be generated either manually or automatically by utilizing spelling-to-sound mapping techniques. For a given language, there can be several base forms for one word. By way of example, in the English language, the word “A” can have two different pronunciations and, therefore, two different base forms. A phonetic lexicon includes a word—base form mapping table that stores the list of vocabulary words for a given language together with their corresponding base forms.
In real-world applications, there are instances in which speech applications that are trained for processing a native language are faced with the task of processing non-native speech or textual data (foreign words). In an ASR system trained on a native language, decoding accuracy can be significantly degraded when native speakers utter foreign words or non-native pronunciations. For example, in a speech-based navigation application having a front-end ASR system trained on native English language, a user may utter a spoken query such as “What is the quickest route to the Champs Elysees”, where “Champs Elysees” are foreign (non-native) words relative to English. Similarly, TTS spoken output from the navigation system may need to recognize that “Champs Elysees” represent foreign words relative to the English TTS system when producing a synthesized speech output such as “Turn Right onto the Champs Elysees”.
A conventional method for generating pronunciations for non-native words is to use a phonetiser adapted for the base native language. In general, a phonetiser system operates to convert text to a corresponding phonetic representation of such text (phonetic spellings). However, when directly converting non-native text to phonetic representations in a native language, non-native pronunciations may not be adequately captured, thereby resulting in degraded system performance. While this approach may be sufficient for a speaker with no knowledge of the foreign language, such approach will certainly be sub-optimal if the speaker has any knowledge of the foreign language, or even just knows how to pronounce the foreign words. For example, in the example navigation phrase above, the English spelling-to-phoneme system may produce the following for “Champs Elysees”:
champs-eh-lie-zeez˜CH AE M P S EH L AY Z IY Z
On the other hand, a person with some knowledge of French, or the proper French pronunciation of the place name, would utter, for example:
shanz-eh-lee-zay˜SH OH NG Z AX L IY Z EY
In view of the disparity in the above phoneme strings, it is unlikely that the latter utterance would be matched to the phoneme string: CH AE M P S EH L AY Z IY Z
Similarly, numbers are pronounced quite differently in different languages. For example, the number 69 would pronounce differently for the following languages:
English—“sixty-nine”
French—“soixant-neuf”
German—“neun-und-sechzig”
The above examples illustrate that there can be a significant amounts of mismatch if the wrong pronunciation is modeled. Conventional solutions to address this problem are not desirable. For instance, running parallel speech recognizers, each capable of performing the ASR task for a particular language, has been suggested, but this approach has a significant CPU and memory resource overhead, and is less capable of handling the mixed-language utterances shown above.