Speech processing systems include various modules and components for receiving spoken input from a user and determining what the user meant. In some implementations, a speech processing system includes an automatic speech recognition (“ASR”) module that receives audio input of a user utterance and generates one or more likely transcriptions of the utterance. ASR modules typically use an acoustic model and a language model. The acoustic model is used to generate hypotheses regarding which words or subword units (e.g., phonemes) correspond to an utterance based on the acoustic features of the utterance. The language model is used to determine which of the hypotheses generated using the acoustic model is the most likely transcription of the utterance based on lexical features of the language in which the utterance is spoken. Some language models are implemented as restrictive grammars, while others are implemented as less restrictive statistical language models. Utterances recognized using a grammar may be processed by downstream processes that take some action in response to the specific utterance recognized using the grammar, while utterances that deviate from those in the grammar are usually misrecognized or rejected.
Speech processing systems may also include a natural language understanding (“NLU”) module that receives textual input, such as a transcription of a user utterance, and determines the meaning of the text in a way that can be acted upon, such as by a computer application. For example, an NLU module may be used to determining the meaning of text generated by an ASR module using a statistical language model. The NLU module can then determine the user's intent from the ASR output and provide the intent to some downstream process that performs some task responsive to the determined intent of the user (e.g., generate a command to initiate the phone call, initiate playback of requested music, provide requested information, etc.).