The reduction of recognition errors is an important issue in automatic speech recognition (ASR). The research shows that when more information is used for recognition reference, the recognition error rate can be effectively reduced. The applicable information includes speech utterance information, speech semantics information, and dialogue context relation information.
The conventional speech recognition technology uses keyword spotting. If the keywords are correctly spotted, the dialogue can be correctly continued to accomplish the task. For conventional information access dialogue systems, such as inquiry of weather, personnel information, ticketing and so on, the high recognition rate of keyword spotting and cooperation with other technologies, such as different dialogue subsystems for different dialogue states, can implement a functional and usable system.
In a more modern dialogue system, the relation between the system and the user is not as fixed as in the conventional systems, where one side asks and the other side answers. The more complicate interaction pattern results in that a usable dialogue system cannot be implemented simply by keyword spotting technology. For example, in a language learning system, the user and the system may interactively ask each other questions, and answer each other's questions to accomplish a task. FIG. 1 shows an exemplary dialogue in such a spoken dialogue system. As shown in FIG. 1, the user (U) and the system (S) uses dialogue to reach the agreement of a time and an activity for mutual participation.
In this example, the dialogue is not always one side asking and the other side answering. Therefore, the following recognition errors may occur:
“Do you like dancing?” may be erroneously recognized as “I do like dancing”; and “would you like to . . . ?” may be erroneously recognized as “What do you like to . . . ?”
In the above example, it is clear that keyword spotting technology may not be able to solve such problems since the system is too focused on keywords, such as “dancing” in above case. If the dialogue context information can be used in the speech recognition, the recognition rate may be greatly improved.
The current technologies include the use of historic dialogue content to improve the recognition rate. For example, Rebecca Jonson disclosed a “Dialogue Context-Based Re-ranking of ASR hypotheses” in IEEE SLT 2006. The technique is to use utterance feature, immediate context feature, the close-context feature, the dialogue context feature, and the possible list feature as the reference for determining the recognition error. The article uses only the contents of the two most recent dialogue turns as the basis for recognition.
Another technique to use historic dialogue content is to compute the related statistic information of the previous dialogue, such as the cancel percentage, error percentage, number of system turns, and number of user turns in the dialogue, without precisely and accurately using the related information of the dialogue content and without the accurate description of the possible relation between the dialogue turns.
The current techniques usually use the previous dialogue sentence (usually one from the system) as the basis for determining the current sentence. However, in actual dialogue, the current sentence may be related to a plurality of previous sentences, instead of relating to only the immediate previous sentence. The current technique may not effectively handle such situations. For example, the current example usually uses N-gram, and when n>3, the frequency distribution will be very sparse.
In a speech recognition system, the rescoring of N-best list to improve the recognition rate is also a widely applied concept. In rescoring of N-best list, the emphasis is to use some additional reference information to re-calculate the confidence measure of each item in the N-best list generated by ASR. The rescored N-best list is believed more reliable than original one, if the reference information is carefully chosen.