Automated speech recognition (ASR) systems assign probabilities to sequences of speech or text known as n-grams. ASR systems transcribe utterances into a series of computer-readable sounds, which are then compared to a dictionary of words in a given language. The n-gram can be used to help select the most likely transcription of an utterance.
Current ASR systems are complex, and include multiple components such as acoustic models, language models, lexicons, and knowledge sources from search infrastructures such as knowledge graphs, and natural language processing (NLP) annotations with semantic and morphological information. As a result, improving the performance of ASR systems and correcting common transcription can involve time consuming training cycles that are difficult to implement.