The conventional speech recognition systems are mostly based on a single language. In the case of multiple-or mixed-language speech recognition, it is common to establish a speech recognition model for a second language (e.g., English) other than a first language (e.g., Chinese) or establish a corresponding relationship between the speech-units of the first language and the speech-units of the second language. Then, the speech recognition model for a single language is adopted to perform the multiple-or mixed-language speech recognition. However, such speech recognition may have some challenges to overcome.
Taking the case of Chinese and English bilingual mixed speech recognition, where Chinese is taken as the native language (i.e., first language) and English is taken as the second language. To train the speech recognition model for speech recognition, it normally requires a significant number of training materials. And the training materials are more easily to collect through recordings of native speakers, i.e., recording the English materials from American people and recording the Chinese materials from Chinese people. However, when training is performed individually for the respective speech recognition models and put the two models together for bilingual mixed speech recognition, either for Chinese's bilingual speech or for American's bilingual speech, the recognition rate is not desirable due to the mismatch accent and can not put the bilingual mixed speech recognition system into practice. Besides, it is difficult to collect and check English materials recorded by the native Chinese speakers due to the pronunciations of the same English word pronounced by different Chinese native speakers may differ significantly. Therefore, through using above poor-quality materials for English speech modeling, it is difficult to come out with a Chinese-accented-English speech recognition system whose performance is as well as the Chinese one, using of the native Chinese speech materials for Chinese speech modeling. Therefore, the resources and efforts required for the multiple-or mixed-language speech recognition are much higher than those required for the single language.
It should be noted that, the native language is the main communication language in local, the frequency of using other languages is normally lower except for the so-called non-native words or foreign words (also referred to as “loan blends”). Moreover, the non-native words or foreign words are updated frequently. Besides, it is noticed that the local users normally speak in local accent. According to the conventional methods for multiple-or mixed-language modeling, a large number of training materials for other languages in local accent are required. For example, the Chinese-accented English materials are required for the modeling of Chinese-accented English speech recognition. However, such materials are not easily to collect.
Therefore, how to establish a speech recognition system capable of recognizing the non-native words, or even providing a speech recognition system for the mixed-lingual in native and non-native languages, without excessively consuming resources, thereby enabling broad applications of such speech recognition system, is certainly an issue for the researchers in this field.