1. Field of the Invention
The present invention relates to an apparatus, a method, and a computer program product for processing voice in speech and supporting communication among people.
2. Description of the Related Art
In recent years, many studies have been made on speech processing techniques including speech recognition and speech synthesis and on language processing techniques including machine translation. Also, many studies have been made on speech language processing techniques including speech translation in which speech processing is combined with language processing. A large number of problems need to be solved before speech translation is put into use in actuality; however, people have high expectations for speech translation techniques as techniques that support communication between people who speak mutually different languages. There are some products that have already been put into practical use by solving technical problems with an arrangement to appropriately limit the range of situations for use or to have the user involved in cooperation.
The levels of performance in speech translation techniques are expected to be higher in the future; however, it is not easy to achieve the ultimate goal of “having speech of both speakers translated correctly at all times in all situations”. As an example, as for speech recognition, which is a part of speech translation techniques, it is not easy to make it possible to consistently recognize the contents of the speech of the users in every environment of use.
In the current technological situation, there is no guarantee that it is possible to always obtain a correct translation result. Thus, to have a speech translation technique that is at a practical-use level, it is important to be able to correct errors efficiently, even when a translation result has an error, and therefore the contents of the speech uttered by the conversation partner is not understandable.
As we take a look at communication among people, when one cannot hear what the other person is saying because it is noisy around them, or when one cannot understand some of the words the other person has said, the errors will be corrected and supplemented through interactions between them. For example, one will ask the other person to speak one more time, or one will check the meaning of a word with the other person. Accordingly, to raise the levels of speech translation techniques to a practical-use level, it is important not only to improve the level of performance in various technical fields that are involved in the speech translation technique, but also to incorporate an interface that is used for correcting errors efficiently, into the system.
When one cannot understand the contents of speech uttered by the other party, one of the simplest ways to correct the error is to ask the speaker to repeat the speech. This is the most reliable method to inform the speaker that the listener did not understand, regardless of the type of the error that has occurred during a speech translation process.
When this method is used, even if the listener has understood some part of the speech, the speaker will be asked to repeat the contents of the entire speech. Thus, the level of efficiency is low. In addition, it is not possible to inform the speaker of the reason why the speech was not translated correctly. Thus, even if the speaker repeats the speech, the same error may be repeated. As a result, there is risk that the conversation may end up in failure.
To cope with this problem, another technique with which the listener is asked to select a portion of a translation result that he/she could not understand has been proposed. Also, another technique with which options of reasons why the listener did not understand the translation result are presented so that the listener can select a reason from the options has been proposed.
According to these techniques, the listener is able to point out only the part that he/she could not understand, instead of the entire speech. Thus, the speaker is able to correct the error by speaking only the part that has been pointed out. Thus, it is possible to keep having a conversation efficiently. In addition, it is possible to allow the listener to select the reason why he/she could not understand, within a range of possible predictions. Thus, it is possible to reduce the possibility of repeating the same error.
However, there is a wide range of reasons why a result of translation cannot be understood. The listener is able to point out only a small portion of the wide range of reasons. To be more specific, the reasons why a translation result cannot be understood may be broadly classified into a group of reasons originating in the speaker or the listener and a group of reasons originating in errors related to the techniques. Examples of the former group include a situation where the contents of speech have been correctly translated, but the listener has inadequate knowledge to understand it, and a situation where the speech itself contains an error. Examples of the latter group include reasons caused by errors related to the technical fields that are involved in the speech translation technique, such as speech recognition and machine translation.
As for the latter group, the reasons related to machine translation can be further classified into errors related to interpretation of words having multiple meanings and errors in syntax analysis. The reasons related to speech recognition can be further classified into linguistic errors like unknown words and acoustic errors like manners of speaking (e.g. the rate of speech, the sound volume of the voice, etc.) and the usage environment (whether there is noise).
Of these various causes of errors, it is difficult for the listener to point out, for example, a problem in the manner of speaking of the speaker, because the listener does not understand the speaker's language. Accordingly, the listener is able to point out only a small portion of the wide range of causes of errors, such as lack of knowledge of the listener himself/herself or errors in interpretations of words having multiple meanings. Especially, when the problem is related to an acoustic error in the speech recognition process, because it is difficult also for the speaker to notice the error, there is a high risk that the same error can be repeated.
To cope with this situation, a technique with which the cause (e.g. the sound volume of the speech or the surrounding environment) that affects the performance level in speech recognition is detected, and the detected cause is presented to the speaker as feedback has been proposed. (For example, see JP-A 2003-330491 (KOKAI)). As disclosed in JP-A 2003-330491 (KOKAI), in the example of a conversation between a machine, as represented by a robot, and a person, the following conditions are satisfied: the conversation takes place on unequal terms between the machine and the person, the speaker speaking to the machine is usually only one person, and the speaker is also the user of the machine. Thus, feedback given to the speaker works effectively both in the short term and the long term.
However, unlike a technique that is designed to support conversations among people, the method disclosed in JP-A 2003-330491 (KOKAI) is not prepared to accommodate a situation where the device is operated by a plurality of users including a person who does not own the device. Thus, because the feedback is returned to the speaker in a uniform manner, there are some situations where it is not possible to correct errors properly.
For example, when a person who is not an owner of the speech processing apparatus is the speaker, even if feedback instructing that the setting of the apparatus should be changed is returned to the speaker, it is not possible for the speaker to address to the problem, because he/she is not familiar with the operation of the apparatus.
This kind of problem is caused because people having conversations are basically on equal terms, when the speech recognition technique is used for supporting communication among people, in other words, when a person communicates with another person via a machine, and also because the premise that the speaker is also the owner of the device is not true.