1. Technical Field
The present disclosure relates to recognition and more specifically to combining speech and non-speech input to improve spelling and speech recognition.
2. Introduction
Automatic speech recognition (ASR) systems that are being deployed today have the ability to handle a variety of user input. ASR systems are deployed, for example, in call-centers where a person may call in and communicate with the spoken dialog computer system using natural speech. A typical call-center transaction might begin with a fairly unconstrained natural language statement of the query followed by a system or user-initiated input of specific information such as account numbers, names, addresses, etc. A transaction is usually considered successful if each of the input items (fields) is correctly recognized via ASR, perhaps with repeated input or other forms of confirmation. This implies that each field has to be recognized very accurately for the overall transaction accuracy to be acceptable.
In order to achieve the desired accuracy, state-of-the-art ASR systems rely on a variety of domain constraints. For instance, the accuracy with which a 10-digit account number is recognized may be 90% using a digit-loop grammar but close to perfect when the grammar is constrained to produce an account number which is in an account-number database. Similarly, if one has access to a names directory and the user speaks a name in the directory, the performance of ASR systems is generally fairly good for reasonable size directories.
In some applications, the use of domain constraints is problematic. As an example, consider an application whose purpose is to enroll new users for a service. In this case, information such as the telephone number, name etc., need to be obtained without the aid of database constraints. One could still use priori constraints, such as a names directory that covers 90% of the US population according to the US Census data, to improve recognition accuracy. However, if the names distribution of the target population does not match the US Census distribution, the out-of-vocabulary (OOV) rate could be substantially higher than 10%.
Recognition of long digit-strings, names, spelling and the like over the telephone, whether human or machine, is inherently difficult. Humans recover from recognition errors through dialog. Such dialogs, which might involve a prompt to repeat a portion of the digit string or a particular letter in a name, have been implemented in ASR systems but with limited success. In the short-term, it appears that the best way to achieve very accurate recognition of difficult vocabularies such as letters and digits is to use to supplement voice with other input modalities such as keypads that produce touch-tones. The telephone keypad is designed for numeric entry and therefore is a natural backup modality for digit-string entry. However, the keypad is not as convenient for the entry of letter strings such as when names are spelled.
Cluster keyboards that partition the letters of the alphabet onto subset keys have been designed to facilitate accurate letter-string entry using keyboards. The letter ambiguity for each key-press in these keyboards is addressed by hypothesizing words in a dictionary that have the highest probability according to a language model. Such methods are effective, but they require the use of specialized keypads. If one is constrained to use the standard telephone keypad, one possibility is to use speech for disambiguation. A scheme for integrating keypad and speech input has been introduced recently but are not as successful as would be desired.
What is needed in the art is a system and method to obtain spelling recognition using information from keypad input and improved strategies for the combined use of the non-speech input such as telephone keypad input as well as voice for highly accurate recognition of spellings.