1. Field of the Invention
The present invention is a multimodal service that provides a user with information and services through a network using a plurality of different modalities, and relates to a method that can easily interlink multiple modalities that differ from one another.
2. Description of the Related Art
When providing a user with information and services through a network, it is conceivable to configure the system so that information is sent and received in the interactive communication, i.e., content that is sent from a server side is displayed on a monitor, such as a CRT or LCD monitor, that is provided to a terminal on the user side, and information that is input by the user by using a keyboard, a mouse, and other input interfaces that are provided to the terminal on the user side is received on the server side. It is possible with such a visual interface, which makes use of the display of images, to display an information list on the monitor, which is advantageous in that it is easy to recognize, acquire, and select needed information from the information list. Nonetheless, it is disadvantageous in that the data input using the keyboard and the input operation using a pointing device, such as the mouse, or another input interface is complicated, and users who are not used to such operation require a great deal of time for the input operation.
In addition, it is also possible to provide a voice interface that is configured so that the terminal on the user side outputs content sent from the server side as voice and receives input of the user's voice. A typical example of such a voice interface is a telephone terminal, which is advantageous in that dialogue can advance by the voice interface and therefore manual input is not needed and operation is simple; however, it is disadvantageous in that the output sent from the server side is also configured as a time series output of voice, which makes it impossible to display the list of information or to easily recognize, acquire, or select needed information.
To make it possible for anyone to receive information and services simply and rapidly, it is preferable to simultaneously use the plurality of different modalities (interfaces) discussed above and to take advantage of their respective merits.
Such a system that makes it possible to provide information and services by synchronizing multiple, different modalities has been proposed in, for example, Patent Document 1 (Japanese Published Unexamined Patent Application No. 2005-148807). The system in Patent Document 1 is configured so that selection definition information, which corresponds to a content generation file for each modality, is prepared in advance, and the content generation files to be applied in accordance with the combination of the modalities to be synchronized are selected and output, thereby making it possible to provide information and services by synchronizing multiple modalities.
Normally, such a system is configured so that the session information for each modality of the plurality of modalities is managed individually, each of the modalities is individually authenticated, and the modalities are then associated based on, for example, information about the user who is using the terminal. Accordingly, there is a problem in that, when each individual modality starts a session, it cannot be linked to other modalities.
To solve such a problem, it is conceivable that when the user originates a call using one modality, an identifier number of the terminal is registered on the server side, and when the same user connects to the server side using another modality, the registered terminal identifier number is sent and the server side thereby recognizes that the two modalities belong to the same session, which makes it possible to interlink the modalities. The following explains how the modalities were linked in the past by taking as an example a case wherein a voice interface is linked to a visual interface.
(1) Authentication Using a Call Originator Number
The example shown in FIG. 15 is configured so that the user side comprises a voice terminal M1, which constitutes a voice interface, and a display terminal M2, which constitutes a visual interface. The voice terminal M1 is connected to a voice dialogue server S1 and comprises a voice output unit that uses voice to output content sent from the voice dialogue server S1, an input unit that receives input using the voice input of the user or DTMF (dual-tone multi frequency), and a data sending and receiving unit that sends and receives the voice content and the user's input to and from the voice dialogue server S1. The display terminal M2 is connected to a voice and visual dialogue server S2 and comprises: a display unit that displays content that includes, for example, image data and text data sent from the voice and visual dialogue server S2; an input unit that receives the input of user data; and a data sending and receiving unit that sends and receives content and input data received from the user to and from the voice and visual dialogue server S2.
Based on a dialogue scenario that is managed by the voice and visual dialogue server S2, the voice dialogue server S1 acquires corresponding content, sends it as voice content to the voice terminal M1, interprets the user input sent from the voice terminal M1 and sends such to the voice and visual dialogue server S2 as input data.
The voice and visual dialogue server S2 manages the dialogue scenario with the user side about the service to be provided and, in accordance with requests from the voice terminal M1 and the display terminal M2, sends the corresponding content, advances the dialogue scenario in accordance with the input data from the voice terminal M1 and the display terminal M2, and manages the correspondence between the voice terminal M1 and the display terminal M2.
When the user originates a call using the voice terminal M1, the voice dialogue server S1 generates a call originator identifier (caller ID) based on the call originator number of the voice terminal M1 and sends the call originator identifier (caller ID) to the voice and visual dialogue server S2 for registration. The voice dialogue server S1 sends voice guidance to the voice terminal M1 to prompt the user to startup the display terminal M2 and connect the display terminal M2 to the voice and visual dialogue server S2.
If the user starts up the display terminal M2 and establishes a connection between the display terminal M2 and the voice and visual dialogue server S2, then the voice and visual dialogue server S2 sends content to the display terminal M2 that prompts the user to input the call originator number (call originator number of the voice terminal M1), searches for the corresponding call originator identifier (caller ID) based on the call originator number input by the user, generates a user identifier (user ID) for the display terminal M2, associates the generated user identifier (user ID) and the call originator identifier (caller ID), and registers that association. Simultaneously, the voice and visual dialogue server S2 sends the generated user identifier (user ID) to the display terminal M2.
Subsequently, the synchronization of the content sent to the voice terminal M1 and the display terminal M2 in accordance with the dialogue scenario makes it possible to provide information and services via multiple modalities that are linked.
(2) Authentication Using a One-time Password
Similar to authentication that uses a call originator number as discussed above, the example shown in FIG. 16 is configured so that the user side comprises a display terminal M3, which constitutes a visual interface, and a voice terminal M4, which constitutes a voice interface. The display terminal M3 is connected to the voice and visual dialogue server S3 and comprises: a display unit that displays content that includes, for example, image data and text data sent from the voice and visual dialogue server S3; an input unit that receives the input of user data; and a data communication unit that sends and receives content and user input data to and from the voice and visual dialogue server S3. In addition, the voice terminal M4 is connected to a voice dialogue server S4 and comprises: a voice output unit that uses voice to output content sent from the voice dialogue server S4; an input unit that receives the input of either a user's voice or DTMF (dual-tone multi frequency); and a data sending and receiving unit that sends and receives voice content and user input to and from the voice dialogue server S4.
Based on a dialogue scenario managed by the voice and visual dialogue server S3, the voice dialogue server S4 acquires corresponding content, sends such as voice content to the voice terminal M4, and sends the user input sent from the voice terminal M4 to the voice and visual dialogue server S3.
The voice and visual dialogue server S3 manages the dialogue scenario with the user side regarding the service to be provided, sends corresponding content in accordance with requests from the display terminal M3 and the voice terminal M4, advances the dialogue scenario in accordance with data input from the display terminal M3 and the voice terminal M4, and manages the correspondence between the display terminal M3 and the voice terminal M4.
When the user uses the display terminal M3 to connect to the voice and visual dialogue server S3, the voice and visual dialogue server S3 generates a one-time password (receipt number) that corresponds to that session, generates a user identifier (user ID) for the corresponding display terminal M3, registers the correspondence therebetween, and then sends that correspondence to the display terminal M3.
Next, if the user originates a call using the voice terminal M4, then the voice dialogue server S4 generates a call originator identifier (caller ID) based on the call originator number of the voice terminal M4, sends the call originator identifier (caller ID) to the voice and visual dialogue server S3, and registers such. The voice dialogue server S4 sends voice guidance to the voice terminal M4 that prompts the user to input a one-time password. If the voice terminal M4 is provided with a button as in, for example, a telephone terminal, then it is possible to adopt a configuration wherein the one-time password can be received by DTMF (dual-tone multi frequency). The voice dialogue server S4 sends the one-time password input by the voice terminal M4 to the voice and visual dialogue server S3. At the voice and visual dialogue server S3, if the one-time password generated for the session with the display terminal M3 and the one-time password sent from the voice dialogue server S4 match, then the corresponding user identifier (user ID) and call originator identifier (caller ID) are associated and registered.
Subsequently, synchronizing the content sent to the display terminal M3 and the voice terminal M4 in accordance with the dialogue scenario makes it possible to provide information and services with multiple linked modalities.
With the two methods discussed above, it is necessary to manually input, for example, the call originator number and the one-time password, which is troublesome and also carries the risk of input error.
In addition, if multiple modalities are associated by performing “(1) Authentication Using a Call Originator Number,” then it is necessary to input the call originator number at the display terminal M2. With the example discussed above, it is assumed that the display terminal M2 is provided with an input unit that uses a keyboard or a pointing device such as a mouse; however, if a modality is not provided with such an input device, then there is a problem in that it is not possible to authenticate the modality as one that is being handled by the same user.
For example, in the case of a system, such as a maintenance and management system or a business management system, wherein the object of management is on-site state information that is acquired from image data taken by an on-site video camera or from one of a variety of sensors and the content to be sent is selected on the server side using such image data, state information, and the like, it is necessary to consider, in addition to the abovementioned display and voice interfaces, the inclusion of other interfaces, such as a video camera for acquiring video data and a sensor interface for acquiring a variety of on-site state information, in the plurality of modalities. There are many cases wherein such a video camera, a sensor interface, and the like do not comprise input devices for inputting data, and there is consequently a risk that it will not be possible to perform the process of associating the abovementioned other interfaces with other modalities.
Likewise, if multiple modalities are associated by performing “(2) Authentication Using a One-time password,” then it is necessary to input the one-time password at the voice terminal M4; however, with a modality that is not provided with a means that is capable of manual input, such as an input button, it is impossible to input the one-time password using DTMF, which makes it impossible to authenticate the modality as one that is being handled by the same user. In addition, even if the voice dialogue server S4 side is provided with a voice recognition function, it is necessary to collect a voice sample from the user in advance and to perform voice recognition based thereon, and therefore there are problems in that it is difficult to apply voice recognition to the process of authenticating the terminal at the time that a connection is being made and in that it is of course impossible to employ voice recognition in a modality that is not provided with a voice input/output function.
It is an object of the present invention to provide a method and a system that can easily associate multiple, different modalities and that provide information and services simply and rapidly by simultaneously using multiple, different modalities.