Spoken language processing systems can process audio data of spoken user input to generate one or more possible transcriptions of what the user said. Spoken language processing systems can then identify the meaning of what the user said in order to take some action in response to the spoken input from the user. Some spoken language processing systems contain an automatic speech recognition (“ASR”) module that may generate one or more likely transcriptions of the utterance. The ASR module may then come up with sequences of words, e.g., tokens, based on certain constraints. Other modules, such as a natural language understanding (“NLU”) module, may then interpret the user's words based on output from the ASR module to determine some actionable intent from the user's utterance.
An ASR module may utilize various models to recognize speech, such as an acoustic model and a language model. The acoustic model is used on features of audio data to generate hypotheses regarding which words or subword units (e.g., phonemes) correspond to an utterance based on the acoustic features of the utterance. The language model is used to determine the most likely transcription of the utterance based on the hypotheses generated using the acoustic model and lexical features of the language in which the utterance is spoken. In a common implementation, the ASR module may employ a decoding graph when processing a given utterance into a sequence of word tokens allowed by the underlying language model.