Recognizing speech is easy for most humans but has proven to be a difficult challenge for computers. Automatic Speech Recognition (ASR) refers to the completely automated transcription of audio to text by computing systems.
Known ASR systems are composed of discrete components which perform a portion of the speech recognition task. For example, ASR systems may include an acoustic model, a decoder, a language model, and a pronunciation model. One type of acoustic model may classify a sequence of audio features as a sequence of phonemes, or units of sound. Typically, the set of phonemes are determined a priori, and the acoustic model selects which of the set of phonemes corresponds to input acoustic features. Some acoustic models rely on hand-tuned descriptions of audio features to detect phonemes.
A pronunciation model then maps the sequence of phonemes to a word by way of a dictionary. The word-to-phoneme dictionary may also be created manually or significantly edited by human experts. An independent language model may then also aid in determining the final transcription by providing a probability of word sequences independent of the acoustic input. These types of ASR systems take much human labor to hand-tune each component and integrate them into a cohesive whole.