1. Field of the Invention
The present invention relates to a speech recognition system and method for recognizing a speech signal, a speech synthesis system and method for synthesizing a speech signal in accordance with the speech recognition, and a program product for use therein.
2. Description of the Related Art
The conventional speech-detecting device adopts a speech recognition technique for recognizing and processing a speech signal by analyzing the frequencies included in a vocalized sound signal. The speech recognition technique is achieved using a spectral envelope or the like.
However, it is impossible for the conventional speech-detecting device to detect a speech signal without the vocalized sound signal that is inputted to the conventional speech-detecting device. Further, it is necessary for a sound signal to be vocalized at a certain volume, in order to obtain a good speech-detecting result using this speech recognition technique.
Therefore the conventional speech-detecting device cannot be used in a case where silence is required, for example, in an office, in a library, or in a public institution or the like, when a speaker may cause inconvenience to people around him/her. The conventional speech-detecting device has a problem in that a cross-talk problem is caused and the performance of the speech-detecting function is reduced in a high-noise environment.
On the other hand, research on a technique for acquiring a speech signal from information other than the sound signal is conducted conventionally. The technique for acquiring a speech signal from information other than a sound signal makes it possible to acquire a speech signal without a vocalized sound signal, so that the above problem can be solved.
The method of image processing based on image information inputted by a video camera is known as a method for recognizing a speech signal based on the visual information of the lips.
Further, the research on a technique for recognizing a type of vocalized vowel by processing an electromyographic (hereinafter, EMG) signal occurring together with the motion of muscles around (adjacent to) the mouth is conducted. The research is disclosed in the technical literature “Noboru Sugie et al., ‘A speech Employing a Speech Synthesizer Vowel Discrimination from Perioral Muscles Activities and Vowel Production,’ IEEE transactions on Biomedical Engineering, Vol.32, No.7, pp 485-490” which shows a technique for discriminating five vowels “a, i, u, e, o” by passing the EMG signal through the band-pass filter and counting the number of times the passed EMG signal crosses the threshold.
The method for detecting the vowels and consonants of a speaker by processing the EMG signal with a neural network is known. Further a multi-modal interface that utilizes information inputted from not only an input channel but also a plurality of input channels has been proposed and achieved.
On the other hand, the conventional speech synthesis system stores data for characterizing the speech signal of a speaker, and synthesizes a speech signal using the data when the speaker vocalizes.
However, there is a problem in that the conventional speech detecting method using a technique for acquiring a speech signal from information other than a sound signal has a low success rate in recognition, in comparison with the speech detecting method using a technique for acquiring the speech signal from the sound signal. Especially, it is hard to recognize consonants vocalized by the motion of muscles in the mouth.
Further, the conventional speech synthesis system has a problem in that the speech signal is synthesized based on the data characterizing the speech signal of a speaker, so that the synthesized speech signal sounds mechanical, expression is not natural, and it is impossible to express the emotions of the speaker appropriately.