1. Field of the Invention
The present invention relates to a speech dialogue system for realizing an interaction between a computer based system and a human speaker by utilizing various input and output techniques such as speech recognition and speech synthesis.
2. Description of the Background Art
In recent years, it has become possible to realize a so called human-computer interaction in various forms by inputting, outputting and processing multi-media such as characters, speech, graphics, and images.
In particular, in conjunction with a significant improvement in the capacities of a computer and a memory device, various applications of a work station and a personal computer which can handle the multi-media have been developed. However, such a conventional workstation or personal computer is only capable of handling various media separately and does not realize any coordination of the various media employed.
Meanwhile, it has become popular to use linguistic data of characters instead of the numerical data ordinarily used in a conventional computer.
As for the visual data, a capacity to handle the monochromatic image data ordinarily used in a conventional computer is expanded to deal with color images, animated images, three dimensional graphic images, and dynamic images.
As for the audio data, in addition to a conventionally used technique for handling speech signal levels, progress has been made to develop various other techniques such as a speech recognition and a speech synthesis, but these techniques are still too unstable to realize any practical applications except in some very limited fields.
Thus, for various types of data to be used in a computer based system such as character data, text data, speech data, and graphic data, there is a trend to make progress from a conventional input and output (recording and reproduction) functions to the understanding and generation functions. In other words, there is progress toward the construction of a dialogue system utilizing the understanding and generation functions for various media such as speech and graphics for the purpose of realizing more natural and pleasant human-computer interaction, by dealing with the content, structure, and meaning expressed in the media rather than the superficial manipulation of the media.
As for speech recognition, the development has been made from an isolated word recognition toward continuous word recognition and continuous speech recognition, primarily in specific task oriented environments accounting for the practical implementations. In such a practical application, it is more important for the speech dialogue system to understand the content of the speech rather than to recognize the individual words, and there has been a progress of a speech understanding system utilizing the specialized knowledge of the application field on a basis a keyword spotting technique.
On the other hand, as For the speech synthesis, development has been made from a simple text-to-speech system toward a speech synthesis system suitable for a speech dialogue system in which a greater weight is given to the intonation.
However, the understanding and the generation of the media such as speech are not so simple as the ordinary input and output of data, so that errors or loss of information at a time of conversion among the media are inevitable. Namely, the speech understanding is a type of processing which extracts the content of the speech and the intention of the human speaker from the speech pattern data expressed in enormous data size, so that it is unavoidable to produce the speech recognition error or ambiguity in a process of compressing the data.
Consequently, it is necessary for the speech dialogue system to actively control the dialogue with the human speaker to make it progress as natural and efficient as possible by issuing appropriate questions and confirmations from the system side, so as to make up for the incompleteness of the speech recognition due to the unavoidable recognition error or ambiguity.
Now, in order to realize a natural and efficient dialogue with a human speaker, it is important for the speech dialogue system to be capable of conveying as much information on the state of the computer as possible to the human speaker. However, in a conventional speech dialogue system, the speech response is usually given by a mechanical voice reading of a response obtained by a text composition without any modulation of speech tone, so that it has often been difficult for the user to hear the message, and the message has been sometimes quite redundant. In the other types of a conventional speech dialogue system not using the speech response, the response from the system has usually been given only as a visual information in terms of text, graphics, images, icons, or numerical data displayed on a display screen, so that the human-computer dialogue has been heavily relying upon the visual sense of the user.
As described, in a conventional speech dialogue system, a sufficient consideration has not been given to the use of the various media in the response from the system for the purpose of making up the incompleteness of the speech recognition and this has been the critical problem in the practical implementation of the speech recognition technique.
In other words, the speech recognition technique is associated with an instability due to the influence of the noises and unnecessary utterances by the human speaker, so that it is often difficult to convey the real intention of the human speaker in terms of speech, and consequently the application of the speech recognition technique has been confined to the severely limited field such as a telephone in which only the speech media is involved.
Thus, the conventional speech dialogue system has been a simple combination of the separately developed techniques related to the speech recognition, speech synthesis, and image display, and the sufficient consideration from a point of view of the naturalness and comfortableness of speech dialogue has been lacking.
More precisely, the conventional speech dialogue system has been associated with the essential problem regarding the lack of the naturalness due to the instability of the speech recognition caused by the recognition error or ambiguity, and the insufficient speech synthesis function to convey the feeling and intent resulting from the insufficient intonation control and the insufficient clarity of the speech utterance.
Moreover, the conventional speech dialogue system also lacked the sufficient function to generate the appropriate response on a basis of the result of the speech recognition.
Furthermore, there is an expectation for the improvement of the information transmission function by utilizing the image display along with the speech response, but the exact manner of using the two dimensional or three dimensional image displays in relation to the instantaneously and continuously varying speech response remains as the unsolved problem.
Also, it is important to determine what should be displayed in the speech dialogue system utilizing various other media.