The disclosures herein relate generally to computer based speech recognition interfaces and more particularly to an enhanced interactive command recognition system.
There currently exist a wide variety of commercially available word recognition systems. Examples of such systems include speech, or spoken word, recognition systems and hand-writing recognition systems. Even spell-checking systems can be considered to be a form of word recognition systems in the sense that they find the most probable words in a dictionary that most closely match a string of characters.
The systems of interest attempt to match acoustic input with stored acoustic patterns based on parameters that include frequency, relative amplitude, and duration to identify words in a predetermined vocabulary. In the case of handwriting, input strokes are analyzed based on parameters including height, angle, and spacing.
Building on analysis of the physical input such as that described above (e.g., acoustic patterns for audio input and input strokes for handwritten input), context can be used to enhance these systems. For example, a dictation system that receives the input xe2x80x9cthe next line of computers to be introduced will include high quality microphonesxe2x80x9d would simply enter that text as a sentence. If the system received xe2x80x9cthe next line of computers to be introduced will include high quality microphones next linexe2x80x9d, the system would enter xe2x80x9cThe next line of computers to be introduced will include high quality microphones.xe2x80x9d and go to the next line. Note that the first xe2x80x9cnext linexe2x80x9d is an implicit command to enter text while the second xe2x80x9cnext linexe2x80x9d is an explicit command to go to the next line. The difference in response is based on the context in which the words are used. Context (including grammar-based) can also be used to resolve words that sound identical such as to, too, and two.
Building on the concepts of input pattern recognition and context is equivalence of meaning. This concept is useful in command recognition and search engines. The object is to determine that two words or phrases are equivalent. This can be accomplished through lists of equivalent words or phrases. In addition, probabilities that each word/phrase in a list is equivalent to the input word/phrase determined by physical pattern matching can be maintained.
A basic distinction can be made between features and techniques in a system that distinguish inputs based on their physical characteristics and those that distinguish inputs based on their meaning. Systems that base their results on techniques that attempt to determine what the input means might be considered a form of natural language interface. Systems combining physical and meaning, or logic, based techniques may be embedded, as in the case of car phones, may be used to implement dictation functions or search engines, may be used purely for command recognition, or may exist as a form of natural language interface to a variety of computer applications. The terms logical and logical techniques will be used to refer nonphysical/non-pattern-matching or intent/meaning based recognition results and processes.
In any of the above cases, it will be recognized that it is possible to determine a probability that the system has properly determined either the identity, in the case of physical recognition, or the meaning, in the case of logical recognition, of a word or sequence of words, and that this probability may be assigned a numerical value. Examples of such systems are shown and described in U.S. Pat. No. 4,783,803 to Baker et al., U.S. Pat. No. 5,960,394 to Gould et al., and other prior art patents.
Clearly, in the case of a system that combines both physical and logical recognition, there will be instances in which the system cannot identify with sufficient certainty a particular word or series of words entered (spoken or written) by the user, as well as instances in which the system fails to recognize the word or words as a logical entity such as a command. In either case, the feedback to the user available from currently available systems would be something along the lines of xe2x80x9cplease repeat your statementxe2x80x9d or xe2x80x9cI think you said . . . xe2x80x9d, leaving the user with no clear understanding of whether the system failed to recognize a word or words individually, or failed to recognize meaning of the word or words.
Therefore, what is needed is a command recognition system that provides more detailed feedback to the user as to why an input was not recognized by the system. For example a dictation system might receive the input xe2x80x9cthe fisherman lost his hook end linexe2x80x9d. The system could attempt to elicit clearer pronunciation through a request to repeat the statement more clearly if the acoustic certainty (between and and end) was low. If the acoustic certainty of end was high, the system might elicit a rephrase of the command because xe2x80x9cend linexe2x80x9d was not a legal command (The system may disallow it due to ambiguity; it might mean last line of page, last line of document, etc.) The example cited here is based on a dictation system, but even greater benefit would be derived from hands free, eyes free audio command systems or information access systems (e.g., search engines) driven by audio or handwritten input.
One embodiment, accordingly, provides an interactive command recognition system. In a preferred embodiment, responsive to a user inputting a command, or word string, to the interactive command recognition system, a physical recognition portion of the system performs physical recognition functions on the input word string and assigns to a number of candidate matches for each of the individual words of the command, a physical score based on the probability that the word was properly recognized by the system, and then computes an average A of these scores. Similarly, a logical recognition portion of the system performs recognition functions on the output of the physical recognition portion, assigns to each of its results a score based on the probability that the word is part of a recognized command, and then computes an average B of these scores.
These averages A and B can then be used in a variety of manners, depending on the particular implementation of the command recognition system. In one embodiment, if B is greater than a predetermined logical threshold, the command is executed. If B is less than the predetermined logical threshold and A is greater than a predetermined physical threshold, indicating that the words were but the command was not understood by the system, the user is advised to rephrase the command. In contrast, if both A and B are less than the respective thresholds, indicating that neither the words nor the command was understood by the system, the user is advised to repeat the command more clearly.
In another embodiment, the averages A and B are weighted using appropriate constants and a sum of the weighted averages is compared to a predetermined threshold. In this embodiment, if the sum of the weighted averages is greater than the predetermined threshold, the command is executed. If the sum of the weighted averages is less than the predetermined threshold, the averages A and B are reweighted using the same or different constants than those used above and a determination is made whether the reweighted average A is greater than the reweighted constant B. If so, the user in advised to rephrase the command; otherwise, the user is advised to repeat the command more clearly.
In yet another embodiment, the input word string is a search request. In this embodiment, a determination is made whether the quality for all matches is less than a Match Quality Threshold (xe2x80x9cMQTxe2x80x9d). Search engines will frequently provide quality ratings for each of the matches returned to the requester, such as one to five stars or a percentage to indicate the relative quality of the matches. The MQT is a value in similar units that indicates that adequate matches were found for the request. If all results are not less than the MQT, the search results are acceptable and output to the user with standard advice; otherwise, a determination is made whether A is less than a predetermined physical threshold. If so, indicating that one or more words may not have been correctly recognized, the search results are output to the user along with an indication that results may be improved by the user""s repeating the input word string more clearly. If it is determined that A is not less than the predetermined physical threshold, the results are output to the user with standard advice.
A principal advantage of the embodiments is that they clarify the additional input needed from the user, whether it be rephrasing of an input command or a clearer repetition of the command. Note that these embodiments show physical and logical based functions of the recognition process occurring sequentially. This is done for clarity, and the only requirement of the invention is that scores for physical and logical based analysis be retained for post-processing or that the disclosed enhancements (e.g., computation of A, B) be partially embedded in the recognition process.