Techniques for accomplishing automatic speech recognition (ASR) are well known. Among known ASR techniques are those that use grammars. A grammar is a representation of the language or phrases expected to be used or spoken in a given context. In one sense, then, ASR grammars typically constrain the speech recognizer to a vocabulary that is a subset of the universe of potentially-spoken words; and grammars may include subgrammars. An ASR grammar rule can then be used to represent the set of “phrases” or combinations of words from one or more grammars or subgrammars that may be expected in a given context. “Grammar” may also refer generally to a statistical language model (where a model represents phrases ), such as those used in language understanding systems.
ASR systems have greatly improved in recent years as better algorithms and acoustic models are developed, and as more computer power can be brought to bear on the task. An ASR system running on an inexpensive home or office computer with a good microphone can take free-form dictation, as long as it has been pre-trained for the speaker's voice. Over the phone, and with no speaker training, a speech recognition system needs to be given a set of speech grammars that tell it what words and phrases it should expect. With these constraints a surprisingly large set possible utterances can be recognized (e.g., a particular mutual fund name out of thousands). Recognition over mobile phones in noisy environments does require more tightly pruned and carefully crafted speech grammars, however. Today there are many commercial uses of ASR in dozens of languages, and in areas as disparate as voice portals, finance, banking, telecommunications, and brokerages.
Advances are also being made in speech synthesis, or text-to-speech (TTS). Many of today's TTS systems still sound like “robots”, and can be hard to listen to or even at times incomprehensible. However, waveform concatenation speech synthesis is now being deployed. In this technique, speech is not completely generated from scratch, but is assembled from libraries of pre-recorded waveforms. The results are promising.
In a standard speech recognition/synthesis system, a database of utterances is maintained for administering a predetermined service. In one example of operation, a user may utilize a telecommunication network to communicate utterances to the system. In response to such communication, the utterances are recognized utilizing speech recognition, and processing takes place utilizing the recognized utterances. Thereafter, synthesized speech is outputted in accordance with the processing. In one particular application, a user may verbally communicate a street address to the speech recognition system, and driving directions may be returned utilizing synthesized speech.
Problems may exist when administrating a speech recognition/synthesis system such as the one set forth hereinabove. For example, a user may call and request a service that requires more information associated with the user before the service can be delivered. For example, a service may require payment. Therefore, in certain situations, there may be a need for information, i.e. billing address, credit card number, etc., before the service can be rendered.