Most real-time automatic speech recognition systems include a speech recognizer and a natural language processor. The speech recognizer converts an incoming stream of sounds into a likely sequence of words. In particular, the speech recognizer receives digitized speech samples from an acoustic input device (e.g., a microphone), and converts the digitized speech samples into sequences of recognized words based upon finite state machine templates. The finite state machine templates are defined by a set of vocabulary word patterns, which are stored in a dictionary, and, possibly, a set of grammar rules. The natural language processor attempts to make sense of the word sequences that are output by the speech recognizer. That is, the natural language processor attempts to extract from each word sequence a meaning that is relevant to a specific user application. In a typical implementation, the natural language processor compares a word sequence with internal semantic patterns that are defined by a grammar compiler. The natural language processor identifies permissible sentences based upon the internal semantic patterns, and outputs summary structures representing permissible sentences. In task-oriented automatic speech recognition systems, an application command translator may match the summary structures with a known set of user application commands, and may return appropriate commands to a user application for processing.
Speech-enabled applications that understand normal spoken sentences often are difficult to implement and usually exhibit frequent recognition errors. In general, the process of automatically recognizing speech is made difficult by three primary obstacles. First, most words are short. With only a few sound features, a speech recognizer usually has difficulty in clearly distinguishing among similar sounding candidates. As a result, numerous incorrect hypotheses from the dictionary are likely to be passed on to the grammar process. Second, grammar processes that attempt to make fine distinctions among possible utterances are complicated, inefficient, and hard to write. Simpler grammars, on the other hand, allow many incorrect utterances that cannot be parsed (or interpreted) for meaning. Finally, even if a correct sentence is transcribed and output, natural language itself typically is ambiguous. Thus, the parsing process implemented by the user application often must test numerous hypotheses using complicated, non-deterministic methods to determine the correct meaning from a natural language input.