Automatic speech recognition is currently a mature set of technologies that have been widely deployed, resulting in great success in interface applications such as voice search. However, it is not easy to build a speech recognition system that achieves high recognition accuracy. One problem is that it requires deep linguistic knowledge on the target language that the system accepts. For example, a set of phonemes, a vocabulary, and a pronunciation lexicon are indispensable for building such a system. The phoneme set needs to be carefully defined by linguists of the language. The pronunciation lexicon needs to be created manually by assigning one or more phoneme sequences to each word in the vocabulary including over 100 thousand words. Moreover, some languages do not explicitly have a word boundary and therefore we may need tokenization to create the vocabulary from a text corpus. Consequently, it is quite difficult for non-experts to develop speech recognition systems especially for minor languages. The other problem is that a speech recognition system is factorized into several modules including acoustic, lexicon, and language models, which are optimized separately. This architecture may result in local optima, although each model is trained to match the other models.
End-to-end speech recognition has the goal of simplifying the conventional architecture into a single neural network architecture within a deep learning framework. To address or solve these problems, various techniques have been discussed in some literature. State-of-the-art end-to-end speech recognition systems are designed to predict a character sequence for a given speech input, because predicting word sequences directly from speech without pronunciation lexicon is much more difficult than predicting character sequences. However, the character-based prediction generally under-performs relative to word-based prediction, because of the difficulty of modeling linguistic constraints across long sequences of characters. If we have much more speech data with the corresponding transcripts, we could train a better neural network that predicts word sequences. However, it is very costly to collect such transcribed speech data and train the network with the large data set. Therefore, it is not easy to incorporate word-level prediction in the end-to-end speech recognition to improve recognition accuracy.