1. Field of the Invention
The present invention relates to a speech recognition based interactive information retrieval scheme aimed at retrieving user's intended information through a speech dialogue with a user.
2. Description of the Background Art
The computer based speech recognition processing is a processing for matching a user input speech with a recognition target database, and calculating a similarity of the input speech with respect to every word in the database as a recognition likelihood. The current recognition technology has a limitation on the number of recognition target words for which the recognition result can be outputting within a real dialogue processing time, and a considerable amount of time is required until returning a response to the user when the number of recognition target words exceeds this limit. Also, a lowering of the recognition accuracy due to an increase of the recognition target words is unavoidable. Moreover, the recognition accuracy is largely dependent on speakers and speech utterance environments, and a lowering of the recognition accuracy due to surrounding noise or a lowering of the recognition accuracy due to incompleteness of the input speech uttered by a speaker can occur even in the case where a recognition device has high performance and accuracy, so that there is no guarantee for being able to always obtain 100% accuracy.
The conventional speech recognition based interactive information retrieval system carries out the recognition processing using a speech recognition device with respect to a user's input speech, keeps a user awaiting until the processing is finished, and presents candidates obtained as a result of the recognition to the user sequentially in a descending order of recognition likelihood by repeating the presentation of candidates until a correct one is confirmed by the user.
On the other hand, in the case of utilizing speech as interface for the information providing service, the real time performance and the accuracy are required. When there are many recognition target words, the target information is classified by an attribute tree formed by a plurality of hierarchical levels. Lower level attributes have a greater possibility of having the number of attribute values that exceeds the number that can be processed within the real dialogue processing time. In order to ascertain the user's intended target information, there is a need to determine an attribute value at each level, but a higher level attribute value can be automatically determined by tracing the tree once a lower level attribute value is determined (provided that the determined lower level attribute value and the related lower level attribute value are in one-to-one correspondence without any overlap). Consequently, it is possible to expect that the target information can be ascertained in short time if it is possible to ascertain the lower level attribute value first.
However, the conventional speech recognition based interactive information retrieval system does not allow the user to input the lower level attribute value first in view of the recognition error and the number of words that can be processed within a time that does not spoil naturalness of the dialogue with the user. Namely, it has been necessary to adopt a method for narrowing the recognition target words down to the number of data that can be processed within the real dialogue processing time by first asking a query for the higher level attribute for which the number of attribute values is small and requesting input, determining the attribute value by repeating presentation of candidates obtained as a result of the recognition in a descending order of recognition likelihood until the entered attribute value can be determined, and selecting only those attribute values that are related to the determined higher level attribute value among the next level attribute values as the next recognition target.
Such a conventional method cannot narrow down the next level recognition target attribute values unless the higher level attribute value is determined so that the presentation of candidates to the user is repeated until the higher level attribute value is determined. However, in this conventional method, a process including the attribute value input request, the candidate presentation and confirmation until the attribute value is determined for each attribute, and the narrowing down of the next level attribute values after the attribute value determination, is required to be repeated as many times as the number of hierarchical levels involved in order to ascertain the target information, and this number of repetition is greater for the target information that has deeper attribute hierarchical levels, so that it has been difficult to ascertain the target information efficiently.
In a system for ascertaining a target information from an information database that comprises the number of words exceeding the number that can be processed within the real dialogue processing time, in order to determine the (lower level) attribute value from which the target information can be ascertained, the user is kept awaiting during the recognition processing and the confirmation process for sequentially presenting the recognition result is carried out. However, when it is difficult to determine the correct attribute value smoothly due to recognition errors, it is necessary to repeat the confirmation process many times despite of the fact that the user has already been kept awaiting, and this can make the dialogue unnatural and cause a great stress on the user.
Consequently, in the current system based on the current speech recognition technology, it is impossible to allow the user's input starting from the lower level attribute value such that a reasonably accurate response can be returned without requiring a wait time to the user, and it is necessary to request the user's input sequentially from the higher level attribute value and repeat the attribute value determination. The recognition target words of the lower level are to be narrowed down by determining the higher level attribute value, so that the dialogue cannot proceed further until the higher level attribute value is determined. In other words, there is a need for the confirmation process until it becomes possible to determine the entered attribute value at each level.
If it is possible to ascertain the lower level attribute value first, the higher level attribute value can be ascertained automatically so that the target information can be ascertained efficiently, and in view of this fact, the currently used process for repeating query, determination and confirmation process until the determination with respect to each query sequentially from the higher level is very circumlocutory or circuitous for the user.
In particular, the user is forced to enter input from the higher level because input from the lower level is not allowed, the presentation and confirmation process must be repeated when it is not possible to obtain a correct attribute value as a top candidate due to recognition errors, and the attribute value input and the confirmation process must be repeated as many times as the number of hierarchical levels involved until the target information is ascertained (the lowest level attribute value is determined) even after determining each input by several trials of the presentation and confirmation process. Although these are indispensable processes for the system, they appear as very circuitous and superfluous processes for the user who prefers natural and short dialogues, and cause a great stress on the user.
As a method for ascertaining the target information while reducing stress on the user, allowing the user's input from the lower level attribute value can be considered, but this requires the determination of the attribute value that has the number of recognition target words exceeding the number that can be processed within the real dialogue processing time.
Also, in the computer based speech recognition processing, the recognition of speeches by unspecified speakers and speeches uttered at irregular utterance speed are particularly difficult, and in addition the degradation of speech quality due to surrounding noise or the like can make 100% speech recognition accuracy practically impossible, so that the instantaneous determination of a speech retrieval key that is entered as the user's speech input is difficult.
Also, in the speech recognition based interactive information retrieval system, in order to realize the natural dialogues with the user, it is prerequisite for the system to return a response to the user's input in real time that does not appear unnatural to the human sense. However, there is a limit to the number of words that can be speech recognition processed within a prescribed period of time. For this reason, when the recognition target is a large scale database having the number of words that cannot be processing within a prescribed period of time, it is difficult to achieve the task requested by the user within a prescribed period of time through natural dialogues between the user and the system, without making the user conscious of the processing time required for the information retrieval at a time of the speech recognition processing by the system as well as the incompleteness of the speech recognition accuracy by the system.
Consequently it is necessary to keep the user awaiting while the system outputs the recognition processing result and when the presented result turns out to be the recognition error it is necessary to keep the user awaiting further until another recognition result is presented, so that it is difficult to construct a system using speech as input interface that has both quickness and accuracy equivalent to a human operator based system, according to the current speech recognition technology.
Also, in the conventional retrieval method aiming at the determination of the retrieval key requested by the user with respect to a large scale database that cannot be processed in real time, because of the limitation on the number of data that can be speech recognition processed in real time, the user is urged to enter a retrieval assist key that can lead to the narrowing down of the retrieval key candidates such that the recognition targets can be reduced from the entire large scale database to the number of data that can be processed in real time, without allowing the user to enter the requested retrieval key immediately.
Here, the retrieval assist keys are selected to be data formed by the number of data that can be processed in real time, such that each retrieval key to be requested by the user always has one retrieval assist key as its higher level key, the retrieval assist key (higher level key) of the retrieval key to be requested is simple and obvious to the user, and lower level keys (the retrieval keys to be requested by the user) belonging to one retrieval assist key are formed by the number of data that can be processed in real time, so as to enable the determination of the retrieval key.
Also, in the conventional retrieval method aimed at the determination of the retrieval key requested by the user using the speech input, the speech recognition processing with respect to the retrieval assist key (higher level key) is carried out first, and the obtained retrieval assist key (higher level key) candidates are presented to the user sequentially in a descending order of the recognition likelihood until a response indicating it is a correct one is obtained. When the retrieval assist key is determined, the retrieval key (lower level key) candidates having the determined retrieval assist key as the higher level key are extracted as the recognition target data, and the input of the retrieval key (lower level key) that the user really wants to request is urged to the user. Similarly as for the retrieval assist key, the retrieval key is determined by presenting the retrieval key candidates obtained by the speech recognition processing to the user sequentially in a descending order of recognition likelihood until a response indicating it is a correct one is obtained.
As such, the current speech recognition technology has a limit to the number of words for which the matching with the speech recognition database, the recognition likelihood calculation and the recognition result output can be carried out in real time, so that a longer recognition time is required when the number of recognition target words is increased. In the speech retrieval system using speech as input interface, when the recognition target is a large scale database, keeping the user awaiting during the speech recognition processing by the system can cause stress on the user, so that the current system carries out the narrowing down of the recognition target by utilizing the attribute values of the attribute items that each recognition target data has, so as to be able to output the recognition result in real time.
However, the current speech recognition technology is such that the 100% speech recognition accuracy cannot be attained even when the recognition target is narrowed down to the number of words that can be processed in real time. In particular, the recognition of speeches by unspecified speakers, speeches uttered at irregular utterance speed, and speech uttered under the noisy environment are particularly difficult, so that the confirmation process for confirming the recognition result to the user is indispensable in order to ascertain the input speech. The confirmation process is a process for presenting the recognition candidates obtained by the speech recognition processing to the user sequentially in a descending order of recognition likelihood. The number of confirmation processes becomes larger for the poorer input speech recognition accuracy. However, the user demands the input interface to have a handling equivalent to the human operator, so that the repeated confirmation processes can cause stress on the user.
In the current speech recognition based interactive information retrieval system using a large scale database as the recognition target, the attribute value input for the attribute item in order to narrow down the recognition target to the number that can be processed in real time is urged, and then the user's requested retrieval key input is urged when the recognition target is narrowed down according to the attribute values, so that the confirmation process is required for both the attribute value and the retrieval key. The attribute value input is an indispensable process in realizing the real time recognition processing from a viewpoint of the system, but it is circuitous for the user because the retrieval key that the user really wants to request cannot be entered immediately, and the confirmation processes are repeated twice, once for the attribute value detection and another for the retrieval key determination, which cause further stress on the user.
Also, the retrieval system using speech as input interface and having a large scale database as the recognition and retrieval target is aiming at providing quick and accurate responses to the user such that the user may have an illusion of dialogue with a human operator, so that it has been necessary to adopt a query format that can lead to the narrowing down of the number of recognition target words effectively for the system such that the recognition processing time and the recognition accuracy can be compensated. For this reason, without allowing the input of the retrieval key that the user really wants to request immediately, the retrieval assist key that can lead to the narrowing down of the retrieval key is determined first. However, the user is forced to enter the input of the retrieval assist key first rather than the retrieval key that the user really wants to request and then urged to enter the retrieval key only after the retrieval assist key is determined, so that this process may appear to the user as if a superfluous process for the user (indispensable process for the system) is forced before the input of the retrieval key that the user really wants to request and can cause stress on the user.