1. Field of the Invention
The present invention relates to a sound signal recognition system for executing a recognition process of a sound signal inputted by a user, and a dialog control system using the sound signal recognition system. In particular, the present invention relates to a sound signal recognition system capable of recognizing an input sound signal correctly in any of the case (1) where the input sound signal contains only a voice signal of a user, the case (2) where the input sound signal contains only a Dual tone multi frequency (DTMF) signal that is inputted as a sound signal from a touch-tone telephone system (push phone telephone system), and the case (3) where the input sound signal is a sound signal in which both a voice signal section and a DTMF signal section are mixed. The present invention also relates to a dialog control system for controlling a dialog flow with a user on the basis of a recognition result of the sound signal recognition system.
2. Description of the Related Art
As a human interface with a computer, speech input by a user's voice becomes important. In a conventional speech recognition system, a voice signal of a user is subjected to speech recognition, and recognized data is passed to a computer as input data from the user. For example, this system started being used for an oral operation of an application of a personal computer or for oral input of text data.
Furthermore, sound signal input using a DTMF signal is also widely used. This sound signal input system of the DTMF signal is widely used for a telephone speech guide system or the like. A user uses a touch-tone telephone system and is connected to a computer via a touch-tone telephone line. For example, a user listens to an audio guidance provided from a computer as speech data via a telephone line and selects number buttons of a touch-tone telephone following the audio guidance and press them to input data in the computer. This DTMF signal that is generated by pressing the number buttons of the touch-tone telephone is referred to as a DTMF signal. The conventional DTMF signal recognition system recognizes the sound signal of the DTMF signal and passes recognized data to the computer as input data from the user.
In addition, the DTMF signal is a signal that is generated by pressing a button in the touch-tone telephone system, which is generated as a merged signal of two fundamental frequencies. FIG. 17 is a diagram showing one example of a DTMF frequency table. In this example, 16 data in total including numbers from “0” to “9”, alphabets from “A” to “D”, and marks “#” and “*” are allocated. For example, with respect to a number “1”, two fundamental frequencies 697 Hz and 1209 Hz are allocated, and when the number button “1” of a touch-tone telephone is pressed, a composite sound signal having the fundamental frequency 697 Hz merged with the fundamental frequency 1209 Hz is generated. This composite sound signal becomes a DTMF signal corresponding to the number “1”.
In general, when a recognition process of a voice signal is compared with that of a DTMF signal, the latter has a higher recognition rate, a smaller processing load, and so forth; however, a DTMF signal can express only a small number of data. Therefore, in order to input complicated data (for example, the name of a user) that cannot be handled only with DTMF signals, input by a DTMF signal and speech input by a user's voice may be switched depending upon the application.
In the conventional telephone audio response system, when a sound signal input by a DTMF signal is used together with speech input by a user's voice, switching of the two input systems is necessary, it is not possible to execute a recognition process of a sound signal in which both a DTMF signal section of a DTMF signal and a voice signal section are mixed.
FIG. 18 is a simplified diagram showing a conventional exemplary configuration of a telephone audio response system in which input by a DTMF signal is used together with input by a voice signal of a user.
In FIG. 18, 500 denotes a sound signal input part, 510 denotes a switching part, 520 denotes a voice signal recognizing part, and 530 denotes a DTMF signal recognizing part.
The sound signal input part 500 receives a sound signal inputted from outside. For example, the sound signal input part 500 receives a sound signal inputted by a user via a telephone line.
The switching part 520 switches transmission destinations of the sound signal inputted from the sound signal input part 500 so as to pass the sound signal either to the voice signal recognizing part 520 or to the DTMF signal recognizing part 530. The switching is controlled, for example, according to a method of switching the transmission destination to the other in the case where a specific DTMF signal such as a specific DTMF signal showing a command for switching the input mode to the other is detected in the sound signal inputted via the sound signal input part 500.
The voice signal recognizing part 520 executes voice recognition of an input voice signal.
The DTMF signal recognizing part 530 executes recognition of an input DTMF signal.
As described above, according to the conventional configuration, the voice signal recognizing part 520 and the DTMF signal recognizing part 530 are provided independently of each other and execute a recognition process independently. In other words, the recognition process is performed using the DTMF signal recognizing part 530 in an input mode by a DTMF signal and using the voice signal recognizing part 520 in an input mode by a voice.
There is also a conventional configuration in which the voice signal recognizing part 520 and the DTMF signal recognizing part 530 are formed as one unit. In this configuration, the switching part 510 is included inside, and a recognition process is conducted using only either one of the voice signal recognizing part 520 or the DTMF signal recognizing part 530 while switching them. Thus, this configuration is essentially the same as that shown in FIG. 18. According to the aforementioned conventional configuration, as a result of recognizing the sound signal, only one of the recognition results of the voice signal or the recognition results of the DTMF signal can be obtained.
Therefore, the conventional telephone audio response system has the following problems.
First, the user needs to switch input by a voice signal and input by a DTMF signal, so that the load of this switching operation increases. Furthermore, there are also cases where the user is not sure in which mode input is to be done, and the user is confused disadvantageously.
Second, when the telephone audio response system side does input of a sound signal in an input mode other than the expected input mode, the recognition rate drops, and in some cases, it leads to the problem of recognition incapability. For example, in the case where the telephone audio response system is expected to perform sound signal recognition using the DTMF signal recognizing part 530, when the user conducts input by a voice, this voice signal cannot be recognized in the DTMF signal recognizing part 530.
Third, since the conventional system cannot recognize a sound signal in which a sound signal section by a voice and a sound signal section by a DTMF signal are mixed, it lacks convenience for a user. For example, when the data “the registration number is 1234” is to be inputted as a sound signal, it is convenient if a sound signal in which a voice signal section is mixed with a DTMF signal section can be inputted as follows: the beginning part of “the registration number is” is inputted by a voice, and then the part of numbers “1234” is inputted as a DTMF signal indicating “1”, “2”, “3”, and “4”, which may be followed by the remaining part inputted by pressing buttons in the touch-tone telephone system. Since the conventional telephone audio response system cannot accept entry of the sound signal in which the voice signal section and the DTMF signal section are mixed as mentioned above, this system lacks convenience for a user.
Fourth, the design of the telephone audio response system is complicated to increase the man-hour, which results in the cost rise. In other words, the conventional telephone audio response system requires guidance for correctly guiding the input mode, so that the algorithm of a dialog flow becomes complicated, which leads to an increase in cost with complication of the design processes.