Interactive systems have been proposed for retrieving information or operating applications interactively by inputs from a keyboard or by input of voice. Such a system will be described taking as an example a recording programming system based upon spoken dialog utilizing voice recognition technology. If the name of a program desired to be preselected for recording is known exactly, then the name of the program can be preselected by voicing the name directly. For example, preselection programming can be performed through dialog of the following kind:
System: Say the name of the program you wish to preselect for recording.
User: News Ten (the name of the program).
System: Do you want to preselect “News Ten”?
User: Yes.
The objective can be attained by this give-and-take. In a case where the name of the program is not known exactly, the objective can be attained by voice entering the program attribute such as channel or broadcast time period and narrowing the candidates using the program attribute as a search key. The following is an example:
System: What is the time of the program you wish to preselect for recording?
User: 9:00 PM.
System: What is the name of the channel of the program you wish to preselect for recording?
User: Channel 11, Tokyo.
System: There are two programs at 9:00 PM on Channel 11, Tokyo. Do you wish to preselect Drama XX?
User: No.
System: Do you wish to preselect Weather Information?
User: Yes.
Thus there has been proposed an interactive system which, at the initiative of the system, narrows down the number of search items in stages by having the user answer questions posed by the system, thereby attaining the objective.
In a system-driven interactive system of the above-described kind, generally an expected value of the number of search items that will prevail after being narrowed down is utilized as selection criteria of questions for narrowing down the objects of the search. However, it is hard to narrow down efficiently by only using the expected value.
The above will be described taking the preselection of television programs for recording as an example.
First, assume a case where the TV programs are narrowed down by their broadcast times (e.g., two time periods, namely AM and PM). For example, for 200 TV programs, if the number of programs in the AM period is 100 and the number in the PM period is 100, then the next number of items searched can be narrowed down from 200 to 100 by asking whether the desired program is in the AM or PM period.
Next, assume a case where TV programs are narrowed down by category (e.g., two categories, namely news programs and other programs other than news programs). For example, for 200 programs, if the number of programs belonging to the first category (news) is 100 and the number of programs belonging to the second category (other than news) is 100, then the next number of items searched can be narrowed down from 200 to 100 by asking a single question in a manner similar to that above.
However, a problem arises if the numbers of programs belonging to the respective categories differ greatly from each other. For example, with regard to 200 programs, consider a case where there is only one program that belongs to the first category and 199 programs that belong to the second category. In this case, the expected value of the number of items to be searched after the question relating to category is answered is 100, which is no different from the above example. If it so happens that a news program has been specified by a single question, then the program can be finalized by this single question. In a case other than this, however, the 200 items to be searched is diminished by only one and, hence, there is little narrow-down effect. This means that a greater number of questions will be necessary, resulting in a longer time for the search.
With regard to voice recognition techniques, recognition performance varies depending upon the question because the recognition vocabulary available as candidates for answers differs depending upon the question presented to the user. Accordingly, in a case where a question in which an answer candidate contains similar words that are difficult to recognize is presented at the start, confirmation is laborious owing to erroneous recognition and, in the end, the search requires a long period of time.
For example, assume that the number of items to be searched is narrowed down from 200 to 100 by answering the question regarding AM or PM. Further, assume that the number of items to be searched is similarly narrowed down from 200 to 100 by answering the question relating to category.
In this case, the expected value and the systematic error are both the same and therefore one may consider that no problem will arise regardless of which approach is adopted. However, if the spoken answer to the AM/PM question and the spoken answer to the category-related question are compared, it will be seen that the former contains many words of similar pronunciation whereas the latter contains few words of similar pronunciation. In other words, a difference in pronunciation-related features appears. When misrecognition based upon voice recognition is taken into account, it will be understood that the number of items to be searched that can be narrowed down differs depending upon the nature of the question asked.