1. Field of the Invention
The present invention relates to the technological field of a dialogue-type information providing service, and in particular, in a dialogue-type information providing system in which a confirmation sentence for confirming the content of a request is output to a user as a character string or speech when the user inputs a sentence denoting a request for information as a character string or speech, wherein, in the case that the user inputs a sentence denoting approval as a character string or speech for the confirmation sentence, the service records the request approved by the user, and in the case that the user inputs a sentence denoting disapproval as a character string or speech, the service waits for the user to input a sentence denoting a new request for information as a character string or speech, and after the type of user request is unambiguously determined among the dialogues between this user and the information provider, the service outputs to the user as a character string or speech a response for providing the information depending on the content of the request approved by the user, and thereby, the dissatisfaction of the user can be decreased.
The present invention also relates to a spoken dialogue apparatus, and in particular relates to a technology that confirms a user request by a spoken exchange, and minimizes the number of the exchanges (turns) with the user in processing the user request.
2. Description of the Related Art
A dialogue-type information presentation service, as represented by the spoken dialogue apparatus, generally comprises the following type of dialogue sequence. When there is a database that stores various types of information such as weather information, television program schedules, event scheduling information, or timetables for transportation systems, if a user inputs a sentence denoting a request for information stored in the database as a character string or by voice using the keyboard, mouse, voice recognition apparatus, or combination thereof, on a computer, first the information provider outputs a sentence series for confirming the content of the user's request to the user via the display, printer, typewriter, voice syntheses apparatus, or combination thereof, on a computer. In the case that the user inputs as a character string or spoken words a sentence denoting the approval of the confirmation, the information provider stores the content of the request that has been approved by the user. In the case that the user inputs as a character string or spoken words a sentence denoting disapproval of the confirmation, the information provider waits for the user to input a sentence denoting a further request. After this is repeated several times and the type of information that should be provided to the user is unambiguously determined, a response sentence for providing information depending on the content of the request approved by the user is output as a character string or voice.
In this type of dialogue-type information providing service, as a first conventional technology, in order to confirm the content of the user's request, there is a method in which the content of the requests input by the user are all confirmed. In addition, as a second conventional technology, there is a method wherein, in the case that the content of the request generated from a sentence input by the user is correct and the information type to be provided to the user has been unambiguously determined, in a dialogue sequence in which all or a portion of the content of the generated request is confirmed, the user approves the confirmation, and the information provider outputs a response sentence depending on the approved content, where the content confirmed by the user is determined such that the total of the individual confirmation sentences and response sentences output by the information provider is minimized. Moreover, in either the first conventional technology or second conventional technology, the content of the request of the user is represented as a set of combinations of attributes and values.
In the first conventional technology, when the sentences input by the user are analyzed and a set of combinations of attributes and values that represent the content of the request of the user is generated, sentences for summarizing all of the attributes that form the content of this request are output. In this conventional technology, there is the problem that an increase in the dissatisfaction of the user accompanies the recognition errors when generating the content of the request based on the sentences input by the user. This will be explained next.
Because recognition error of the character string or the voice accompanies the generation of the content of the request based on sentences that the user inputs as a character string or by voice, there are cases in which the content of a request is generated that differs from the content of the request that the user intends. In this case, the content of the sentence output by the information provider for confirmation differs from the content of the request that the user intended. Because the user inputs a sentence that denotes disapproval, the information requested by the user is not output to the user. The user inputs content for making the request again, the information provider requires reconfirmation, and thereby the dissatisfaction of the user increases.
In order to avoid the increase in the dissatisfaction of the user, it is necessary to decrease the number of confirmation sentences output by the information provider. Thus, in the second conventional technology, in the case it is assumed that the content of the request generated from a sentence input by the user is correct and the information type provided to the user has been unambiguously determined, in a chain of dialogue sequences in which all or a portion of the attributes included in the content of the generated request is confirmed, the user approves the confirmation, and the information provider outputs a response sentence depending on the requested content that has been approved, the attributes to be confirmed by the user are determined such that the total of individual confirmation sentences and response sentences are minimized.
The total of the numbers of confirmation sentences and response sentences output during the dialogue sequence is called the dialogue cost of the dialogue sequence. The reason for taking into consideration the number of response sentences is that there is the possibility that the degree of the user's dissatisfaction will increase if the number of response sentences increases.
The second conventional technology has the limitation that it can be applied in the case in which it is assumed that the content of the request generated based on sentences input by the user is correct and the type of information to be provided to the user is unambiguously determined, and in the case that this assumption is not satisfied, there is the problem that the dissatisfaction of the user increases because the second conventional technology cannot be used while the first technology must be used.
Below, a concrete example of the problems of the first conventional technology and the second conventional technology will be explained. As an example, consider an information providing service relating to weather information. There are two types of information that can be provided to the user: the location and time of the weather and warnings announced at certain locations. The content of the user's request is represented by two attributes (items): location and information type. The location attribute can take the value of a prefecture name such as Kanagawa Prefecture or Kagawa Prefecture, and the information type can take the values weather and warning.
Now, suppose a situation in which the information provider maintains the data “presently no warnings have been issued anywhere”. At this time, the user inputs a sentence denoting the request “Please inform me about warnings for Kanagawa Prefecture,” and due to recognition errors or the like, the information provider mistakenly recognized the content of the user's request to be “Please inform me about warnings for Kagawa Prefecture.” The content of the generated request is the set of attributes and values wherein the value of the attribute location is Kagawa Prefecture and the attribute of the information type is warning. This set of attribute and value combinations is written as follows:
{<location, Kagawa Prefecture>, <information type, warning>}
In this case, in the first conventional technology, the information provider outputs the sentence for confirmation, “Are you interested in warnings for Kagawa Prefecture?”, in order to confirm all of the attributes included in the content of the request generated based on the sentence of the user. The user inputs the sentence “No” denoting disapproval, since the content of the intended request differs from the content of the confirmation, and then must input the sentence “warnings for Kanagawa Prefecture” to make the request again. Next, the information provider generates a new content of the user's request. This time, when the content of the request of the user can be correctly generated, the information provider outputs the confirmation sentence “Are you interested in warnings from Kanagawa Prefecture?” The user inputs the sentence “yes”, denoting approval, and the information provider outputs the one response sentence “No warnings for Kanagawa Prefecture have been issued.” Thus, the information provider outputs a total of three sentences: a confirmation sentence “Are you interested in warnings for Kagawa Prefecture?”; a confirmation sentence “Are you interested in warnings for Kanagawa Prefecture?”; and a response sentence “No warnings for Kanagawa Prefecture have been issued.” The user is requested for confirmation two times, which increases the dissatisfaction.
The second conventional technology determines which among the attributes included in the request content should be confirmed. In the presently assumed example, as indicated below, there are two attributes included in the content of the request: the location “Kagawa Prefecture” and the warnings:
{<location, Kagawa Prefecture>, <information type, warning>}
To repeat, due to recognition error, the content of this request differs from the content of the request for information about warnings in Kanagawa Prefecture that the user wishes to know.
First, the content of the request is confirmed, the user's approval is input, all dialogue sequences for a response are generated, and the dialogue sequence is selected for which the total of the number of confirmation sentences and response sentences, that is, the dialogue cost, is minimized. Here, two dialogue sequences are considered: a dialogue sequence A, in which both the two attributes of location and information type are confirmed until the approval of the user is input, and subsequently, one response sentence whose content is that the location approved by the user has no warnings issued, is generated, or a dialogue sequence B, in which one attribute among location or information type is confirmed until the user's approval is input, and subsequently, one response sentence whose content is that no warnings have been issued anywhere, is generated.
In the second conventional technology, when an attribute is approved, it is necessary to estimate the number of confirmation sentences output until the user will approve. The number of confirmation sentences output until the user approves depends on the precision with which the value of the attribute is confirmed and the like. There is the possibility that the values of each of the attributes may differ from the request intended by the user due to recognition error. Thus, to the extent that the confirmed attributes are increased, the possibility of the user disapproving the confirmation increases, and the number of confirmation sentences will increase by an equivalent amount. Here, it is assumed that the number of confirmation sentences is estimated to be twice the number of attributes to be confirmed.
In sequence A, the confirmation sentence “Are you interested in warnings for Kagawa Prefecture?” is output. The user disapproves the confirmation because the content of the intended request and the content of the request differ, and the request is carried out again. It is estimated that until the user inputs approval, the number of confirmation sentences output while confirming the two attributes location and information type will be 4. When approved, the information provider outputs the confirmation sentence “Are you interested in warnings from Kanagawa Prefecture?”, the user approves this confirmation, and the one response, “No warnings have been issued for Kanagawa Prefecture,” is output. The dialogue cost is 5, which is the total number of confirmation sentences and response sentences.
In sequence B, the confirmation sentence “Are you interested in warnings?” is output. Until an approval is input from the user, the number of confirmation sentences output while confirming the one attribute information type is estimated to be two. When the user approves the confirmation, the one response sentence, “There are no warnings issued anywhere,” is generated. The dialogue cost is 3, which is the total number of the confirmation sentences and response sentences.
In conclusion, sequence B, having the lowest dialogue cost, is chosen, and the information provider outputs the confirmation sentence, “Are you interested in warnings?”, in order to confirm only the attribute information type.
In this manner, there are cases in which the problems of the first conventional technology are solved by the second technology. However, when assuming that the content of the request generated based on the sentences that the user inputs are correct, application is restricted to the case in which the information type provided to the user can be unambiguously determined, and in the case that this assumption cannot be satisfied, there are the problems that the second conventional technology cannot be used, the first conventional technology must be used, and the dissatisfaction of the user is increased.
For example, suppose the situation in which the information provider maintains as data in a database the data: “Presently no warnings have been issued anywhere.” The user inputs a sentence denoting the request, “Please inform me about warnings for Kanagawa Prefecture”, and due to a recognition error, the information provider mistakenly recognizes “Kanagawa Prefecture” as “Kagawa Prefecture”, and does not recognize “warnings” at all. The content of the request will be mistakenly recognized to be the content “Please inform me about Kagawa Prefecture”. The content of the generated request will be the following:
{<location, Kagawa Prefecture>}
When it is assumed that the content of this request is correct, there are two types of information that can be provided to the user: weather and warnings. The determination of which is intended cannot be made. Therefore, the second conventional technology cannot be applied. The first conventional technology must be used, and the confirmation sentence “Kagawa Prefecture?” is output. As has already been explained, because this differs from the content of the request intended by the user, the user inputs the content for the request again, a reconfirmation is required by the information provider, and the dissatisfaction of the user increases.
When the determination is based on the content of the request, the reason that the second conventional technology cannot be applied in the case that there is a plurality of information types to be provided is that no method is provided that compares the number of confirmation sentences and response sentences in the dialogue sequences for each type of information. However, even in the case that the there may be a plurality of information types to be provided, by calculating the probability of each information type, determining the dialogue sequence that minimizes the total total of the number of confirmation sentences and response sentences for each information type, and then taking into consideration the probability of each information type, there are cases in which the attributes to be confirmed can be selected so as to make the total of the number of confirmation sentences and reply sentences as small as possible. Next, consider an example thereof.
In the situations assumed above, the types of information are either warnings or weather. Considering that the probabilities of the information types are equal, the probability of the information type “warning” is 0.5, and the probability of the information type “weather” is 0.5.
Next, for each of the provided information types, the dialogue sequence having the minimum dialogue is determined, this dialogue sequence is called the optimal dialogue sequence related to this information type, and this dialogue cost is called the optimal cost for this information type.
Consider when the information type is a warning. Under the assumption that this is a warning, since it is determined that there is one information type, the second conventional technology can be used. The dialogue sequence in which the number of the confirmation sentences and the number of response sequences is minimized is dialogue sequence B, wherein the a response is output after the information type has been confirmed and the user has approved. The dialogue cost of dialogue sequence B is 3. The optimal dialogue sequence related to the information type “warning” is dialogue sequence B, and the optimal cost is 3.
Consider when the information type is weather. At this time, all the dialogue sequences are generated in which the content of the request is confirmed, the approval of the user is awaited, and a response is made. The dialogue sequence having the minimum dialogue cost is selected. Here, these dialogue sequences can be considered: a dialogue sequence C, in which the location is confirmed until the approval of the user is input, and next, the information type is confirmed until the approval of the user is input, and then a response is made; and a dialogue sequence D, in which the information type is confirmed until the approval of the user is input, the location is confirmed until the approval of the user is input, and then a response is made. In either of the dialogue sequences, the number of confirmation sentences is estimated to be 4. The response sentence is assumed to be generated by one sentence describing what the weather is at a certain location and at a certain time. The dialogue cost for either of the dialogue sequences is 5. The optimal dialogue sequences related to the information type “weather” is the dialogue sequence C and the dialogue sequence D, and the optimal cost is 5.
Next, taking into consideration the probability of the provided information types, the attributes to be confirmed are determined such that the total of the number of confirmation sentences and response sentences can be made as small as possible. Thereby, for each of the attributes that form the content of the request, the dialogue sequence having the minimum dialogue cost is calculated assuming that these attributes have been initially determined for each information type. This dialogue sequence is called the next most optimal dialogue sequence for confirmation of attributes, and the dialogue cost is called the next most optimal dialogue cost for confirmation of attributes. The number that results from subtracting the next most optimal cost for confirmation of attributes from the optimal cost is called the loss due to the confirmation of attributes. When the loss due to the confirmation of attributes is compared to the optimal cost when these attributes are initially confirmed, which bears the heaviest cost is designated. Taking into consideration the probability of each of the provided information types, if the attribute that minimizes as much as possible the expected value of the loss due to the confirmation of attributes is confirmed first, a dialogue sequence can be selected that has a dialogue cost as close as possible to the optimal cost.
For example, take into consideration the attributes “information type” and “location” as attributes to be confirmed. First, assume that the information type is confirmed. For the provided information type “warning”, the next most optimal dialogue sequence for confirming the information type becomes the dialogue sequence B. The next most optimal cost for the confirmation of the information type is 3, and the loss due to the confirmation of the information type is 0. For the provided information type “weather”, the next most optimal dialogue sequence for the confirmation of the information type becomes the dialogue sequence D, and the next most optimal cost for the confirmation of the information type is 5. The loss due to the confirmation of the information types is 0. Because the probabilities of each of the provided information types are equivalent, the expected value of loss due to the confirmation of the information types becomes 0.
Second, assume that the attribute “location” has been confirmed. For the provided information type “warning”, the next most optimal dialogue sequence for the confirmation of the location becomes the dialogue sequence A. The next most optimal cost for the confirmation of the location is 5, and the loss due to the confirmation of the location is 2. For the provided information type “weather”, the next most optimal dialogue sequence for the confirmation of the location becomes the dialogue sequence C, and the next most optimal cost for the confirmation of the location is 5. The loss due to the confirmation of the location is 0. Because the probabilities of each of the provided information types are equal, the loss due to the confirmation of the location is 0.
Therefore, if the content of the generated request is the content “Please inform me about Kagawa Prefecture”, by the confirmation sentence “Are you interested in warnings or weather”, the attribute “information type” can be initially confirmed, and the dialogue cost can be made as small as possible.
In fact, when the location is confirmed first, after a number of the confirmation sentences for the confirmation of the location are output, the information type is confirmed, and in the case that the information type is understood to be warnings, the confirmation sentence output for the confirmation of the location is useless. The reason is that in the case of warnings, whether or not the location is confirmed, the number of response sentences is identical, and if this is the case, the dialogue cost for the dialogue sequence that does not confirm the location can be made as small as possible. In contrast, even if the information type is confirmed first, and whether the information type is warnings or weather, the confirmation of the information type is not useless, and the dissatisfaction of the user is not increased.
Irrespective of whether there exists an attribute such that the expected value of the loss due to the confirmation of attributes is minimal, in the conventional technology, when the determination is made based on the content of the request, in the case that there is a plurality of information types to be provided, there is the problem that all of the attributes in the content of the request are to be confirmed. Thus, increasing the dissatisfaction of the user, which should be avoidable by using attributes such that the expected value of the loss due to the confirmation of attributes is minimized, cannot be avoided.
In addition, in the case that the provided information types are determined unambiguously according to the content of the request that the user has approved, if the optimal dialogue sequence does not include any confirmations, outputting a response that does not carry out any confirmations produces the smallest dialogue cost, and thus a response sentence should be output depending on the optimal dialogue sequence. The conventional technology does not take this point into account.
On the other hand, in a spoken dialogue apparatus, the content of the spoken words of a user are understood by speech recognition, and the content of the request of the user is determined. However, speech recognition has limitations, and there is the possibility that the results of the recognition will include errors. Therefore, in order for the spoken dialogue apparatus to confirm the content of the request of the user without depending only on the result of the speech recognition, the content of the words that have been understood by the apparatus must be confirmed.
In addition, when there is a difference between vocabulary and wording that can be accepted by the spoken dialogue apparatus, even if tentative confirmation of all the information in the range transmitted to the apparatus has completed, there are cases in which the content of the request of the user is not clear. In this case, the spoken dialogue apparatus must request information from the user.
The series of exchanges between the apparatus and the user generated by the confirmations and requests for information from this type of spoken dialogue apparatus is called a confirmation dialogue. If the content that can be processed change (tasks), the object of confirmation also changes. A method in which confirmation is carried out without increasing the number of dialogue exchanges even in the case that the task is updated is necessary.
Conventionally, in the case that a task is updated, the requests that can be received by the spoken dialogue apparatus that acts without increasing the number of dialogue exchanges is limited to one.
In addition, there is another conventional technology wherein the dialogue is carried out using a minimum of labor and a plurality of requests are received. However, in the case that a task is updated, rules must be defined manually, and a confirmation sequence such that the number of dialogue exchanges (number of tasks) would not increase automatically could not be applied.
Thus, in order to carry out confirmation without increasing the labor of the user when a task is updated using the conventional technology, it was necessary to limit the type of the content of the request of the user that could be pro-processed to one. However, actually, having this type of restriction is not practical.
In the case, for example, that video control is carried out using a spoken dialogue apparatus, at least “recording time setting”, “change recording time setting”, and “confirm recording time setting”, must be carried out. Thus, with just these the device must already be able to accept three requests. In the case that a plurality of requests can be accepted, because the content that is confirmed by the requests differ, the confirmation cannot be carried out in a sequence that is determined before the fact.
In addition, the conventional technology that can accept a plurality of requests and carry out a dialogue with little labor requires manually defining rules in the case that a task is changed.