The development of computer processing techniques allows realizing a speech dialogue system, in which a speech synthesis technique for converting input text to read-out speech and a speech recognition technique for recognizing pronunciation of a person are used to repeat dialogue by speech between a user and a response system to thereby solve a problem. Due to the development of communication networks, such a speech dialogue system can be used through a communication network.
FIG. 1 illustrates a configuration of an example of such a speech dialogue system. The speech dialogue system is a center-type speech dialogue system including the response system arranged on a data center 2 (hereinafter, called “center 2”) on a communication network 1.
When a user 4 speaks to an input apparatus, such as a microphone, included in a terminal 3, the terminal 3 converts the speech to speech data and transmits the speech data to the center 2 through the communication network 1. The center 2 uses the speech recognition technique to recognize the content of the speech from the received speech data and performs dialogue control to create an answer according to the content of the speech. The center 2 uses the speech synthesis technique to convert the answer to the speech data. Subsequently, the terminal 3 downloads speech data and display data from the center 2 through the communication network 1 to sequentially reproduce the speech data and the display data. In this way, the user 4 can use the speech dialogue system as if the user 4 is talking with another person. A speech control menu 6 for displaying the answer, inputting speech, rewinding the speech, terminating the speech, or fast-forwarding the speech as illustrated in a screen display 5 can be displayed on the terminal 3 to provide a function of a Web browser or the like based on speech.
The center-type speech dialogue system can be used from portable terminals, such as smartphones, used by many people, and the center-type speech dialogue system has an advantage that highly accurate speech recognition and high-quality speech synthesis using a large number of hardware resources of the center 2 are possible. The center-type speech dialogue system also has an advantage that information on the communication network, such as an external service and Web information, can be used to utilize real-time information for creating an answer in the center 2.
If the center 2 creates an answer in a format of a so-called scenario describing a procedure of screen display and speech reproduction, the terminal 3 can not only reproduce the speech data, but can also display text and images.
The speech dialogue system can be used to provide various services, such as information of nearby restaurants and tourist information, as well as for listening to latest news or weather forecast.
In relation to the speech synthesis technique, there is a known technique in which synthesized speech can be output without a pause in the speech before the end of a speech synthesis process of an entire sentence, even in the middle of the reproduction of the synthesized speech. In the technique, the output of the synthesized speech is scheduled based on responsiveness of a generation process of sound waveform data of each divided sentence obtained by dividing an input sentence by one or a plurality of synthesis units and based on responsiveness of a formation process of synthesized speech for combining the sound waveform data.
There is also a known technique in which prepared redundant word speech data is output when speech synthesis data generated by input of a conversational sentence is not input for a certain time, and a silent state of conversation is apparently shortened to reduce the stress of the other party of the conversation.
In a speech dialogue process, there is a known technique of preventing conflict between a plurality of speech input and output processes. In the technique, if an estimated time required for a second speech process including speech output executed according to a low-priority service scenario is shorter than an estimated free time until timing of a first speech process executed according to a high-priority service scenario, the second speech process is executed.
In the speech dialogue system, there is a known technique of quickly and accurately managing order of dialogue between a user and an agent. In the technique, dialogue information analyzed from speech generated by the user is used to generate first dialogue order information, and expression information analyzed from face images of the user is used to generate second dialogue order information. The order information, state information of the system, presence/absence of speech input by the user, and no-response time of the user are used to determine ultimate order of dialogue.
In a speech content distribution system for distributing content for outputting speech to a terminal apparatus, there is a known technique of reducing the time before the output of the speech by the terminal that has received the content. In the technique, a content distribution apparatus replaces a readout character string, which is in content data describing the readout character string that is a character string to be read out as speech, by a phonetic symbol string that is data for identifying output speech. The terminal apparatus outputs the speech based on the phonetic symbol string extracted from the content data that is received from the content distribution apparatus and that describes the phonetic symbol string.