Increasingly, businesses, industries and commercial enterprises, among others employ automated telephone call systems with interactive voice response (IVR) offering self-service menus. Instances of contacting an actual human responder are becoming rare. These automated telephone call systems utilize technologies such as automatic speech recognition (ASR), which allows a computer to identify the speech utterances or words that a caller speaks into their telephone's microphone and match it with the voice drive menu. Such automated telephone call centers employing existing ASR technologies are prone to errors in identification and translation of a caller's speech utterances and words. With the increased use of cordless and cellular telephones, the instances of errors are compounded due to the inherent noise and/or static found in such wireless systems. Hence, a large percentages of callers' speech utterances or words are distorted such that only partial units of information gets processed by the automated telephone call systems resulting in re-prompting callers for menu selection choices that user previously stated, or erroneous responses by the system, or no response at all.
A conventional method of automatic speech recognition (ASR) 100 is illustrated in FIG. 1, which requires that a caller first utter a speech utterance or word 110, which is then transcribed into text by ASR transcription 120 (speech-to-text conversion). The output of the ASR transcription (or test string) 120 is passed to the ASR interpreter/grammar module 130 for semantic interpretation or understanding. Typically, this form of ASR semantic interpretation usually involves a simple of process of matching the recognized form (e.g. text string) of the caller's speech utterance or word with the pre-defined forms that exist in the grammar. Typically, each matched item is assigned a confidence score by the system and so when there is a positive match 140 with a high confidence score then the output is used by the dialog manager (not shown) to execute the next relevant action, (e.g., transition to a new dialog state or to satisfy the user's request) 160.
By contrast, when the recognized text string does not match the pre-defined existing forms in the grammar, this results in an instance of a negative match or a “No Match,” 150. Consequently, the conventional ASR system 100 will have to increase the error count and give the user additional tries by returning to the previous dialog state to ask for the same information all over again 170. The number of retries is a variable that can be set by a voice user interface call flow variable where the usual practice is to cap the number of retries to a maximum of three, after which the system gives up and caller is transferred to an agent. This is the source of the problem in the current implementation, e.g., the blanket rejection of utterances that do not match (100%) with the existing pre-defined forms in the grammar. For example, if a caller utters, “I want to speak to the director of Human Language Technology” what may be recognized by the conventional ASR system 100 is only partial information such as “-anguage-logy”. Based on the conventional matching process, the text strings “language” and “technology” which are pre-defined in the grammar will not match the partial forms “-anguage” and “-logy”, resulting in such partial information being treated as a No Match because it is rejected by the ASR interpreter/grammar module 130. As a result the caller is asked to try again by the conventional ASR system 100 and so on and so forth until a successful match (translation) is achieved within the limited number of tries else the caller is transferred to the agent.
In some instances, the developer may formulate post-processing rules which will map, for example, partial strings like “-anguage” to full forms like “language”. The problem is that this is not an automatic process, and very often occurs later in the development process (during the tuning of the application after some interval from the initial deployment), and also only some items (high frequency errors) are targeted for such post-processing rules. In other words, post processing rules are selective (applies to isolated items), manual (not automatic), and costly to implement since it involves human labor. Accordingly, the problem in conventional ASR systems described above, is that current speech systems simply fail to make any fine-grained distinction within the No Match classification. In other words, in instances where a caller's utterance or word does not match completely with what is listed in the ASR interpreter/grammar module 130, it is rejected as No Match as lacking any intelligence that can be used to respond to a caller and thus move the dialog with automated telephone call systems along to the next sequence. Upon reaching the maximum number of retries (and if the error persists) the call ends up being transferred to an agent. For the success of self-service automation and to increase wider user adoption of speech systems, it is extremely important to solve this problem, particularly as the majority of users' calls are made from a cordless or cellular phone which, as explained above, have poor quality of reception thereby increasing the likelihood of a users' utterances or words to be partially recognized.
Having set forth the limitations of the prior art, it is clear that what is required is a method, system or computer program storage device capable of fine-grained distinction within the No Match classification of an ASR system to improve the success rate of self service automation in an automated telephone call systems with interactive voice response self-service menus.