Automatic Speech recognition (ASR) has evolved over the years and is expected to be a primary form of input in computing and entertainment devices. Since speech recognition requires a large amount of computing power and energy from the battery source of mobile devices, most current solutions for speech processing are provided in a cloud environment to provide a higher accuracy rate of speech-to-text conversion.
ASR involves several steps and components. Most important, ASR components are the language model and the acoustic model. The language model explains the language or grammar of the language that is being converted to text; the language model includes a text file that contains words which the ASR can recognize. The acoustic model describes how each word is pronounced. Typically, these models (acoustic and language models) are large as they have to cover all possibilities of language for different speakers (and their voice acoustics). Usually, a larger model covers multiple scenarios and reduces the error rate.
ASR systems currently are Cloud-based ASR (Cloud-ASR) implementations that are targeted at Large Vocabulary Continuous Speech Recognition (LVCSR), and use a Deep Neural Network (DNN) Acoustic Model.