A. Field of the Invention
The invention relates to speech recognition systems and, more particularly, to a system for recognizing words based on the continuous spelling thereof by a user and, when possible, for prompting the user with an early identification of the word being spelled.
B. The Prior Art
Speech recognition systems convert spoken language to a form, such as a data string, that is easily managed by a computer. Once converted to a data string, the information may then be used by the computer in a variety of ways. For example, it may be stored or output by the computer in textual form or it may be used to control a physical system. Since speech is the most common communication medium among people, significant effort has been directed at developing and improving speech recognition systems.
One commercially desirable application of speech recognition is a telephone directory response system in which the user supplies information of a restricted nature, such as the name and address of a telephone subscriber, and receives, in return, the telephone number of that subscriber. An even more complex system is a telephone ordering system in which the user supplies user-specific information (e.g., name, address, telephone number, special identification number, credit card number, etc.) as well as transaction-specific information (e.g., nature of item desired, size, color, etc.) and the system, in return, provides information to the user concerning the desired transaction (e.g., price, availability, shipping date, etc.).
The recognition of natural, unconstrained speech by a speaker-independent computer recognizer remains a complex and unsolved problem. The greatest difficulty arises from the enormous variations with which the same word or words may be pronounced by different people and even by the same person under different circumstances. This difficulty is exacerbated when there is environmental or background noise or when an inherently noisy transmission medium is being used (e.g., a telephone line). As a result, speech recognition systems often seek to simplify the recognition task in various ways. For example, they may require the speech to be noise-free (e.g., by using a good microphone), they may require the speaker to pause between words, or they may limit the vocabulary that can be understood to a small number of words.
A description of the current state-of-the-art in speech recognition systems may be found in D. Pallett, J. Fiscus, W. Fisher, J. Garofolo, B. Lund, A. Martin and M. Przybocki, 1994 Benchmark Test for the ARPA Spoken Language Program, Proc. Spoken Language Systems Technology Workshop Jan. 22-25, 1995, Austin, Tex. Furthermore, an example of an interactive, speaker-independent speech recognition system is the SUMMIT system being developed at the Massachusetts Institute of Technology. This system is described in Zue, V., Seneff, S., Polifroni, J., Phillips, M., Pao, C., Goddeau, D., Glass, J., and Brill, E. "The MIT ATIS System: December 1993 Progress Report," Proc. ARPA human Language Technology Workshop, Princeton, N.J., March 1994, among other papers. Although still unable to recognize natural, unconstrained speech, as mentioned above, commercial adaptations of these research systems and other commercially available systems have attained a level of performance that makes certain general-use applications feasible.
Since present speech recognizers, especially user-independent systems, a re unable to recognize every word being spoken, it is desirable for such systems to employ some type of fall-back mode or procedure for those situations in which the recognizer fails to identify a given word or words. These fall-back procedures, moreover, must be highly accurate. Otherwise, the user, who has already encountered a misrecognition or non-recognition by the system , m ay become sufficiently frustrated as to terminate his or her interaction with the system.
For speech recognition systems that are accessed over the telephone, one such fall-back procedure is to instruct the user to spell the desired word using the keys of a touch-tone telephone. Each key typically represents three possible letter choices. To distinguish between the various letters assigned to each key, the user may be instructed to press two keys for each letter. For example, to select the letter "A" which is the first letter listed on the numeral "2" key, the user would press the "2" key to identify the letter group desired and then the "1" key to identify the position within the letter group of the desired letter. Although such systems are eventually able to recognize most words, the procedure is cumbersome and time-consuming. A significant improvement to this approach is described in J. Davis Let Your fingers Do The Spelling: Implicit Disambiguation of Words Spelled With The Telephone Keypad Journal of the American Voice I/O Society 9:57-66 (March 1991). In this system, the user presses one key per letter and the system keeps track of all possible letter sequences represented by the succession of single keystrokes (each of which may represent any of three letter s). Each possible letter sequence generated by the series of keystrokes is compared against a master list of allowed words stored in the system. Despite the large number of letter sequences, the system is typically able to match the keystrokes to a unique word in the list.
The system disclosed by Davis also provides for early identification. Early identification is the selection of a word for presentation to the user before the user has entered all of the letters of the word. That is, a word is identified by the system and acted upon as soon as the sequence of keystrokes eliminates all but one possibility, even if the word has not yet been completely spelled by the user. Such systems provide improved response time and performance, by not requiring the complete spelling of every word.
Nevertheless, there are several disadvantages to spelling an unrecognized or misrecognized word via the keys of a touch-tone telephone. First, such a system obviously relies on a telephone keypad that is accessible to the user and connected to the system. Thus, the system has limited applicability. Second, it is disrupting to the user to begin voice interaction with a computer system and then switch to touch key entry. Also, since most users have not memorized the location of each letter on a touch-tone keypad and since certain letters are missing, it is awkward and time-consuming to spell in this fashion. It would be more desirable for the user to speak to the system than use the touch keys of a telephone.
Spoken letter recognition by a computer, however, is a difficult problem to solve for several reasons. First, many letters, such as those comprising the "E-set" (i.e., B, C, D, E, G, P, T, V and Z) are often confused with one another. Furthermore, on the telephone, the speech signal is often degraded by bandpass filtering and the quality of some telephone components, causing additional confusion between letters such as "S" and "F". In addition, if the letters are spoken continuously, the boundaries between the end of one letter and the beginning of the next are not readily apparent, causing two possible problems. First, confusion among letter sequences may occur (e.g., A J versus H A). Second, the recognizer may erroneously insert and delete letters in its hypothesis, thereby detecting the wrong number of spoken letters and making recognition even more difficult. Since the prior art speech recognition systems are unable to overcome these problems, they are not sufficiently accurate for use as a fallback procedure.
A "discrete-spoken" spelling system that separately prompts the user to speak each letter of the unrecognized or misrecognized word (e.g., "state first letter", "state second letter", etc.) is described in Marx, M. (co-applicant herein) and Schmandt, C. "Reliable Spelling Despite Poor Spoken Letter Recognition" Proc. of the American Voice I/O Society, San Jose, Calif., Sep. 20-22, 1994. By prompting the user for each letter, this approach avoids any confusion over how many letters were spoken by the user. In addition, the discrete-spelling system can identify and process each spoken letter separately, potentially resulting in greater recognition accuracy as compared to a continuous spelling system.
The discrete-spoken spelling recognition system disclosed by Marx and Schmandt also incorporates the implicit disambiguation and early identification features of the touch-tone system described by Davis. As in the touch-tone system, the discrete-spoken spelling recognition system keeps track of all possible letter sequences while the user continues to spell the desired word(s) by stating each letter. Similarly, each letter sequence is compared with a list of allowable words and the system identifies the spelled word once the list is narrowed to only one possibility. This occurs even if the user has not yet spelled the word(s) to completion. Recall that in the touch-tone system, ambiguity arises from having three possible letters for each key. In the speech recognition system, the ambiguity arises from misrecognition of the individual letters by the recognizer employed in the system. That is, one letter may be confused with another (e.g., M for N).
As part of the development of the discrete-spoken spelling recognition system, a list of likely misrecognitions between letters was developed for each letter by running the speech recognition engine on various spoken examples of each letter (e.g., different speakers, different accents, etc.). The list thus provides a set of possible letters that may have been spoken by the user for each letter hypothesized by the speech recognition engine. For example, if the recognition engine returns a "v", the list indicates that the user might have said "b", "d", "e", "p", "v" or "z". This set of possible letters is then used by the disambiguation aspect of the system to generate all possible letter sequences for comparison to the list of allowable words. The Marx and Schmandt system is thus able to provide a high level of accuracy, despite errors made by the speech recognition engine.
Nonetheless, discrete-spoken spelling systems, such as the one described above, have certain limitations. For example, because the user must wait for the system to prompt him or her for each letter, the system appears slow and time-consuming to the user. In sales-oriented and other voice recognition systems, it is extremely important that the system appear quick to the user. Indeed, a system that seems time-consuming or slow may be avoided by a user, possibly resulting in lost sales. In addition, if a user were to provide two letters despite being prompted for only one or if the recognition engine so misidentifies a single letter that the corresponding set of possible letters does not include the actual letter spoken, then the system will not identify the correct word(s). The present invention is directed, in part, to solving these limitations by providing a highly accurate computer system for the recognition of continuously spoken spelling.