End-to-end automatic speech recognition (ASR) has recently proven its effectiveness by reaching the state-of-the-art performance obtained by conventional hybrid ASR systems while surpassing them in terms of ease of development. Conventional ASR systems require language-dependent resources such as pronunciation dictionaries and word segmentation, which are incorporated into models with phonemes as an intermediate representation. These resources are developed by hand and so they carry two disadvantages: first, they may be error-prone or otherwise sub-optimal, and second, they greatly increase the effort required to develop ASR systems, especially for new languages. The use of language-dependent resources thus particularly complicates the development of multi-lingual recognition systems. End-to-end ASR systems, in contrast, directly convert input speech feature sequences to output label sequences (mainly sequences of characters or tokens composed of n-gram characters in embodiments of the present invention) without any explicit intermediate representation of phonetic/linguistic constructs such as phonemes or words. Their main advantage is that they avoid the need for hand-made language-dependent resources.
There have been several prior studies on multilingual/language-independent ASR. In the context of deep neural network (DNN)-based multilingual system, the DNN is used to compute language independent bottleneck features. Therefore, it is necessary to prepare language dependent backend system like pronunciation dictionaries and language models. In addition, it is necessary to predict uttered language to cascade language independent and language dependent modules.