Speech recognition processing is used to enable recognition and translation of spoken language into text. Applications and services may utilize a speech recognition result to enhance processing and productivity. A goal of speech recognition processing (e.g., automatic speech recognition (ASR)) is to generate a word sequence from a speech acoustic. In doing so, a word unit is the most natural output unit for network modeling. Accuracy and precision in determining words from speech acoustics is not very good if the training data amount is not huge, partially due to the high out-of-vocabulary (OOV) rate. There are many challenges when working solely with word-based modeling for speech recognition processing. A first technical issue relates to detection and processing of OOV tokens. Only frequent words in a training set are used as the targets and the remaining words are just tagged as OOV tokens. These OOV tokens cannot be modeled and cannot be recognized during decoding evaluation. This causes inaccurate and incomplete results during speech signal decoding. This can be recognized when speech recognition detection fails to identify (or misses) spoken words by a user. Another technical issue of word-based modeling is that such models are not equipped to handle hot-words, which emerge and become popular after the network has been built. For instance, specific words or phrases may become normal speech that are not initially recognized by a trained word model. It is impossible to get satisfactory performance by directly adding output nodes in the network with the specified hot-words without retraining the network.