A speech recognition system analyzes a user's speech to determine what the user said. Most speech recognition systems are frame-based. In a frame-based system, a processor divides a signal descriptive of the speech to be recognized into a series of digital frames, each of which corresponds to a small time increment of the speech.
A continuous speech recognition system can recognize spoken words or phrases regardless of whether the user pauses between them. By contrast, a discrete speech recognition system recognizes discrete words or phrases and requires the user to pause briefly after each discrete word or phrase. Continuous speech recognition systems typically have a higher incidence of recognition errors in comparison to discrete recognition systems due to complexities of recognizing continuous speech.
In general, the processor of a continuous speech recognition system analyzes “utterances” of speech. An utterance includes a variable number of frames and may correspond to a period of speech followed by a pause of at least a predetermined duration.
The processor determines what the user said by finding acoustic models that best match the digital frames of an utterance, and identifying text that corresponds to those acoustic models. An acoustic model may correspond to a word, phrase or command from a vocabulary. An acoustic model also may represent a sound, or phoneme, that corresponds to a portion of a word. Collectively, the constituent phonemes for a word represent the phonetic spelling of the word. Acoustic models also may represent silence and various types of environmental noise.
The words or phrases corresponding to the best matching acoustic models are referred to as recognition candidates. The processor may produce a single recognition candidate (that is, a single sequence of words or phrases) for an utterance, or may produce a list of recognition candidates.
Correction mechanisms for some discrete speech recognition systems displayed a list of choices for each recognized word and permitted a user to correct a misrecognition by selecting a word from the list or typing the correct word. For example, DRAGONDICTATE™ for MICROSOFT WINDOWS™, by Dragon Systems, Inc. of Newton, Mass., displayed a list of numbered recognition candidates (“a choice list”) for each word spoken by the user, and inserted the best-scoring recognition candidate into the text being dictated by the user. If the best-scoring recognition candidate was incorrect, the user could select a recognition candidate from the choice list by saying “choose-N”, where “N” was the number associated with the correct candidate. If the correct word was not on the choice list, the user could refine the list, either by typing in the first few letters of the correct word, or by speaking words (for example, “alpha” , “bravo”) associated with the first few letters. The user also could discard the incorrect recognition result by saying “scratch that”.
Dictating a new word implied acceptance of the previous recognition. If the user noticed a recognition error after dictating additional words, the user could say “Oops”, which would bring up a numbered list of previously-recognized words. The user could then choose a previously-recognized word by saying “word-N”, where “N” was a number associated with the word. The system would respond by displaying a choice list associated with the selected word and permitting the user to correct the word as described above.