Speech recognition systems may be generally classified into three broad categories based on the task that is being accomplished: (a) Speech-to-Text systems (sometimes referred to as Dictation systems) wherein the task is to recognize continuously spoken words to produce the output; (b) Large Vocabulary Telephony systems; and (c) Embedded Command-and-Control systems wherein the task is to recognize spoken words representing some set phrases that in turn represent some command or a control to the system.
Commercial Speech-to-Text systems include Dragon-NaturallySpeaking, IBM-ViaVoice, Microsoft-Speech, and others. These systems are generally deployed on a personal computer and are useful for dictating letters, documents, medical/legal reports, etc. These Speech-to-Text systems typically resort to stochastic language modeling techniques (referred to as N-Gram); however, limited vocabulary speech-to-text may also be achieved using context free or other finite state grammars. In a Speech-to-Text system, the user is generally allowed to speak in a free-form dictation mode, as in “Please meet me tonight at 10 p.m. in front of the Seattle Train Station; John and I will wait for you in front of the Barnes & Noble book store.”
Speech recognition of free-form dictation style speech is a fairly onerous task. It is complicated by what is referred to as “language model perplexity” of the task. The major problem stems from the fact that users could say any word followed by any word(s) from a vocabulary that could range into hundreds of thousands of words. To improve accuracy, many systems resort to techniques like domain specific language modeling, interpolated language modeling, etc. Unfortunately, the problem may be viewed as far from being solved, and hence these systems have had limited commercial success.
Commercial telephony systems include Large Vocabulary systems developed by companies like Nuance, SpeechWorks, etc. These systems typically address telephony applications like banking, stock quotes, call center automation, and directory assistance. These Large Vocabulary systems generally use statistical and/or context free and/or finite grammar based language models. In applications deployed by these systems, the users are restricted to a phrase as in “Stock Quote for Charles Schwab.” Using techniques like word spotting and natural language processing, some systems relax this constraint, allowing users to speak freely as in “Please find me a quote for Charles Schwab stock if you don't mind.”
Medium/Small vocabulary Command-and-Control systems are offered by many embedded speech recognition companies, including VoiceSignal, Conversay, Fonix, Sensory, ART, and VoCollect. These typically address applications like name-digit dialing for cellular phones, Personal Information Management for personal digital assistants, data entry for industrial environments, etc. The Command-and-Control systems usually resort to finite state grammars. In a Command-and-Control system, the user is generally restricted to say a phrase in a fixed way as in “Tune Radio to 98.3” or “Go To Email-Box.”
Telephony and Command-and-Control systems have at times resorted to speak-and-spell mode in multiple scenarios that include: (a) entering a new word in a lexicon; (b) generating pronunciations for words; (c) improving accuracy for tasks like directory assistance or name dialing; and (d) correcting errors made by the recognition system. It is well known that a speech recognition system's accuracy may be improved by asking the users to speak-and-spell the words as in “Call JOHN SMITH spell that J-O-H-N-S-M-I-T-H” as opposed to “CALL JOHN SMITH.” For example, the MIT Laboratory of Computer Science has published a research paper (refer to “Automatic Acquisition of Names Using Speak and Spell Mode in Spoken Dialogue Systems”, Seneff and Wang, March 2003 herein incorporated by reference) wherein the authors believe that the most natural way to enter data (for their application) would be through the speak-and-spell scenario.