A speech recognition system analyzes a user's speech to determine what the user said. Most speech recognition systems are frame-based. In a frame-based system, a processor divides a signal descriptive of the speech to be recognized into a series of digital frames, each of which corresponds to a small time increment of the speech.
A continuous speech recognition system can recognize spoken words or phrases regardless of whether the user pauses between them. By contrast, a discrete speech recognition system recognizes discrete words or phrases and requires the user to pause briefly after each discrete word or phrase. Continuous speech recognition systems typically have a higher incidence of recognition errors in comparison to discrete recognition systems due to complexities of recognizing continuous speech.
In general, the processor of a continuous speech recognition system analyzes “utterances” of speech. An utterance includes a variable number of frames and may correspond to a period of speech followed by a pause of at least a predetermined duration.
The processor determines what the user said by finding acoustic models that best match the digital frames of an utterance, and identifying text that corresponds to those acoustic models. An acoustic model may correspond to a word, phrase or command from a vocabulary. An acoustic model also may represent a sound, or phoneme, that corresponds to a portion of a word. Collectively, the constituent phonemes for a word represent the phonetic spelling of the word. Acoustic models also may represent silence and various types of environmental noise.
In a typical speech recognition system, a user dictates into a microphone connected to a computer. The computer then performs speech recognition to find acoustic models that best match the user's speech. The words or phrases corresponding to the best matching acoustic models are referred to as recognition candidates. The computer may produce a single recognition candidate (i.e., a single sequence of words or phrases) for an utterance, or may produce a list of recognition candidates. Typically, the best recognition candidate is immediately displayed to the user or an action corresponding to the best recognition candidate is performed. The user generally is permitted to correct errors in the recognition. Other recognition candidates may also be displayed.