1. Field of the Invention
The invention relates to a novel approach for generating multilingual text-to-phoneme mappings for use in multilingual speech recognition systems.
2. Description of the Prior Art
Speaker independent command word recognition and name dialing on portable devices such as mobile phones and personal digital assistants has attracted significant interest recently. A speech recognition system provides an alternative to keypad input for limited size portable products. The speaker independence makes the system particularly attractive from a user point of view compared to speaker dependent systems. For large vocabularies, user training of a speaker dependent recognizer is likely to become too tedious to be useful.
How to build acoustic models that integrate multiple languages in automatic speech recognition applications is described by F. Palou, P. Bravetti, O. Emem, V. Fischer, and E. Janke, in the publication “Towards a Common Phone Alphabet for Multilingual Speech Recognition”, In Proceedings of ICSLP, pages 1—1, 2000.
An architecture for embedded multilingual speech recognition systems is proposed by O. Viiki, I. Kiss, and J. Tian, in the publication “Speaker- and Language-independent Speech Recognition in Mobile Communication Systems”, in Proceedings of ICASSP, 2001.
Use of neural networks for UP giving estimates of the posterior probabilities of the different phonemes for each letter input is taught by K. Jensen, and S. Riis, in the publication “Self-Organizing Letter Code-Book for Text-To-Phoneme Neural Network Model”, published in Proceedings of ICSL.P, 2000.
A phoneme based speaker independent system is ready to use “out-of-the-box” and does not require any training session of the speaker. Furthermore, if the phoneme based recognizer is combined with a text-to-phoneme (TTP) mapping module for generating phoneme pronunciations online from written text, the user may define specific vocabularies as required in e.g. a name dialling application. Naturally, speaker and vocabulary independence comes at a cost, namely increased complexity for real-time decoding, increased requirements for model storage, and usually also a slight drop in recognition performance compared to speaker dependent systems. Furthermore, speaker independent systems typically contain a number of language dependent modules, e.g. language dependent acoustic phoneme models, TTP modules etc. For portable devices, the support of several languages may be prohibited by the limited memory available in such devices as separate modules need to be stored for each language.
Recently, systems based on multilingual acoustic phoneme models have emerged—see the letters written by Palou et al., and Viikiet al., mentioned above. These systems are designed to handle several different languages simultaneously and are based on the observation that many phonemes are shared among different languages. The basic idea in multilingual acoustic modeling is to estimate the parameters of a particular phoneme model using speech data from all supported languages that include this phoneme. Multilingual speech recognition is very attractive as it makes a particular speech recognition application usable by a much wider audience. In addition the logistic needs is reduced when making world wide products. Furthermore, sharing of phoneme models across languages can significantly reduce memory requirements compared to using separate models for each language. Multilingual recognizers are thus very attractive for portable platforms with limited resources.
Even though multilingual acoustic modeling has proven efficient, user definable vocabularies typically still require language dependent TTP modules for each supported language. Prior to running the language dependent TTP module it is furthermore necessary to first identify the language ID of each vocabulary entry.
Language Dependent Text-To-Phoneme (TTP) Mapping
For applications like speaker independent name dialling on mobile phones the vocabulary entries are typically names in the phonebook database 21 that may be changed at any time. Thus, for a multilingual speaker independent name dialler 22 to work with language dependent TTP, a language identification module (LID) 30 is needed. An example of a multilingual speech recognition system according to prior art using a LID module 30 is shown in FIG. 3. In FIG. 3 it is shown how the LID module 30 selects a language dependent TTP module 31.1–31.n that is used for generating the pronunciation by means of a pronunciation lexicon module 32 for a multilingual recognizer 33 based on multilingual acoustic phoneme models.
The LID module 30 may be a statistical model predicting the language ID of each entry in the vocabulary, a deterministic module that sets the language ID of each entry based on application specific knowledge, a module that simply requires the user to set the language ID manually or a combination of these. In the most general case, a priori knowledge about the language ID is not available and manual language identification by the user may not be desirable. In that case, language identification must be based on a statistical LID module that predicts the language ID from the written text.
Depending on the application, the TTP module 31.1–31.n may be a statistical model, a rule based model, based on a lookup table that contains all possible words, or any combination of these. The latter approach will typically not be possible for name dialing applications on portable devices with limited memory resources, due to the large number of possible names.
In most applications based on user defined vocabularies, a statistical LID module 30 has very limited text data for deciding the language ID of an entry. For e.g. a short name like “Peter”, only five letters are available for language identification. Furthermore, many names are not unique for a single language but rather used in a large number of languages with different pronunciations. In addition to this, a speaker may pronounce a foreign/non-native name with a significant accent, i.e., the pronunciation of the name is actually a mixture of the pronunciation corresponding to the language from which the name originates and the native language of the speaker.
This implies that the combination of language dependent TTP modules 31.1–31.n, a statistical LID module 30 and multilingual acoustic phoneme model is likely to give a poor overall performance. Furthermore, if several languages are to be supported in a portable device, the size of the LID and TTP modules may have to be severely limited in order to fit into the low memory resources of the device. For “irregular” languages, like English, high accuracy TTP modules may take up as much as 40–300 kb of memory, whereas TTP modules for rule based “regular” languages like Japanese and Finnish typically require less than 1 kb.