1. Field of the Invention
The present invention relates generally to speech recognition methods and systems. More specifically, the present invention relates to a method, device, and computer program product for multi-lingual speech recognition.
2. Description of the Related Art
This section is intended to provide a background or context. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the claims in this application and is not admitted to be prior art by inclusion in this section.
Automatic Speech Recognition (ASR) technologies have been adopted in mobile phones and other hand-held communication devices. A speaker-trained name dialer is probably one of the most widely distributed ASR applications. In the speaker-trained name dialer, the user has to train the models for recognition, and it is known as a speaker dependent name dialing (SDND) application. Applications that rely on more advanced technology do not require the user to train any models for recognition. Instead, the recognition models are automatically generated based on the orthography of the multi-lingual words. Pronunciation modeling, also called text-to-phoneme mapping (TTP), based on orthography of the multi-lingual words is used, for example, in the Multilingual Speaker-Independent Name Dialing (ML-SIND) system, as disclosed in Viikki et al. (“Speaker- and Language-Independent Speech Recognition in Mobile Communication Systems”, in Proceedings of International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, Utah, USA 2002). Due to globalization as well as the international nature of the markets and future applications in mobile phones, the demand for multilingual speech recognition systems is growing rapidly.
Automatic language identification (LID) is an integral part of multi-lingual systems that use dynamic vocabularies. LID module detects the language of the vocabulary item. Once the language has been determined, the language dependent pronunciation model is applied to obtain the phoneme sequence associated with the written form of the vocabulary item. Finally, the recognition model for each vocabulary item is constructed by concatenating the multilingual acoustic models. Using these basic modules the recognizer can, in principle, automatically cope with multilingual vocabulary items without any assistance from the user. The pronunciation of a given text in a particular language can usually be found in automatic speech recognition and text-to-speech systems. However, conventional systems are generally unable to find the pronunciations of the texts in any other language supported by the system. Other languages may be considered mismatched languages. It is common to have the mismatched languages due to some reasons, e.g. LID errors, non-native vocabulary items, N-Best or multiple pronunciation scheme, etc. It is not trivial to find the pronunciations of the given texts in mismatched languages because different languages have different alphabet sets and different pronunciation rules. For example, one cannot directly find English pronunciation of Russian text “” because of the different alphabet sets between English and Russian.
There is a need to handle multi-lingual textual input in multi-lingual automatic speech recognition systems. Further, there is a need to process multi-lingual automatic speech recognition such that TTP can be applied to find the pronunciation for textual input in any supported language.