Speech-enabled applications using speaker-independent speech recognition technology are characterized by a vocabulary utilizing a language dependent phonetic description. And typically, the vocabulary uses language specific acoustic models for such phonetic symbols. The applications therefore utilize a native language phonetic inventory or a foreign language phonetic inventory to recognize and transcribe the vocabulary to be recognized.
Current speech recognition systems support only individual languages. If words of another language need to be recognized, acoustic models associated with that language must be used. An acoustic model is a set of acoustic parameters generated during training, representing the pronunciations(s) of a word or sub-word unit and is used for speech recognition. For most speech recognition systems, these models are built, or trained, by extracting statistical information from a large body of recorded speech. To provide speech recognition in a given language, one typically defines a set of symbols (phonetic symbols and phonemes will be used interchangeably hereinafter), known as phonemes, which represent all sounds of that language. For multi-language, speaker-independent speech recognition, a common set of multilingual phonemes and acoustic models are needed for each respective language under consideration. When supporting a wide range of languages a large database covering all languages must be provided.
A large quantity of spoken samples in each language is typically recorded to permit extraction of an acoustic model for each of the phonemes for each particular language. Usually, a number of native speakers—i.e. people having the language as their mother tongue—are asked to record a number of utterances. A set of the recordings of the utterances is referred to as a speech database. The recording of such a speech database for every language one wants to support is very costly and time consuming.
Name dialing in a mobile phone environment may serve as an example for such a scenario. In a mobile phone, words to be added to the active vocabulary may stem from different languages as entries in the personal phonebook, such as names from different countries. To address such phonebook entries using a voice enabling process the names need to be processed properly to derive acoustic models for them. For recognition, acoustic models are utilized for representing the pronunciation of words, phonemes, etc. Typically such acoustic models are language specific and a lexicon lookup or language specific grapheme-to-phoneme (G2P) algorithm yields a phonetic description, which is used by a language specific phoneme-based recognizer. The phonetic realization, i.e., the pronunciation of foreign words uttered by a non-native speaker is the main problem of a multi-language approach to obtain good recognition results for these foreign words.
Most foreign words typically cannot be described accurately in a phonetic inventory of the speaker's native language. (NL) since sounds for specific phonetic units generally do not correspond exactly in the NL. Nevertheless, the articulation capability of the speaker is reflected in the sounds of the NL represented by the NL phonetic inventory. A phonetic inventory is a (language specific) set of phonetic symbols that does not include sound recordings. In some cases the foreign words cannot be completely described in the NL due to missing sounds (e.g. the English word email [ee m ey I] is not describable by the German phoneme set due to the missing sound [ey] in the German language). Moreover, NL acoustic models for FL words are inaccurate and result in low recognition performance even if uttered by FL speakers. Additionally, even when the foreign word can be transcribed within the NL phoneme inventory problems may arise from the “phono-tactics” of FL (foreign language) words, which do not correspond to the phono-tactics of NL words. Phono-tactics is the set of allowed arrangements or sequences of phonemes and thus, speech sounds in a given language. A word beginning with the consonant cluster (zv), for example, violates the phono-tactics of English, but not of Russian. In particular when context dependent acoustic models (like triphones) are utilized for the recognition, different phono-tactics may result in missing triphones and thus in a less accurate modeling of words and a reduced recognition performance.
Basically, even applying a multi-language recognition engine that supports several (native) languages to recognize application words spoken by non-native speakers will not yield the best results. This is due to the fact that the non-native speaker will color the foreign words with the speaker's mother tongue and the description of the foreign words with the FL phoneme inventory and FL acoustic models is usually not accurate enough and will not necessarily give best recognition results. The best solutions—training of these FL words with non-native speech of many speakers—is usually not feasible due to the very limited availability of large, appropriate training databases, i.e., speech recordings of non-native pronunciations (i.e. from many NL speakers) of FL words from the specific target language(s). Typically, the speaker is not at all familiar with the FL inventory and describing newly added foreign words with NL phonemes is a serious undertaking.
What is needed is a speech recognition system that accepts input from speakers speaking words that are foreign to the speakers' native language and is capable of utilizing the speaker's native phoneme inventory to describe the pronunciation of foreign language words where utterances of these words, from native and non-native speakers, are recognized with high accuracy.