1. Field of the Invention
The present invention relates to a speech input and detection technique that is not affected by noise occurring in a noise environment or a situation where many people speak simultaneously. And the invention relates to a speech detection apparatus for outputting speech information that is detected from movements of an articulator of a human to information equipment such as a computer or a word processor.
The invention relates to a technique of enabling detection of speech information in both cases of voiced speech and voiceless speech by mimicry. Therefore, the technique of the invention can be utilized not only in offices or the like where silence is required and the use of related speech input techniques is not suitable, but also for input of a content that the user does not want to be heard by other people. As such, the invention greatly increases the range of use of speech detection apparatus. Further, the invention can be utilized for a speech detection apparatus for providing barrier-free equipment that enables deaf people, people having difficulty in hearing, and aged people to communicate information smoothly.
2. Description of the Related Art
The target of a speech detection apparatus (machine) is to enable the user""s speech to be input correctly and quickly in any environment. An ordinary speech detection apparatus employs a speech recognition technique of recognizing and processing speech information by analyzing frequencies of a voice as a sound. To this end, the cepstrum analysis method or the like is utilized that enables separation and extraction of a spectrum envelope or a spectrum fine structure of a voice. However, this speech recognition technique has a principle-related disadvantage that naturally it cannot detect speech information unless it receives sound information generated by vocalization. That is, such a speech detection apparatus cannot be used in offices, libraries, etc. where silence is required, because during speech input a voice of a speaker is annoying to nearby people. This type of speech detection apparatus is not suitable for input of a voice having a content that the user does not want to be heard by nearby people. Further, the user will be rendered in a psychology of feeling reluctant to murmur alone to the machine. This tendency is enhanced in a situation where people exist around the user. These disadvantages limit the range of use of speech recognition apparatus and are major factors of obstructing the spread of speech input apparatus. Another obstructive factor is that continuing to speak is unexpectedly a physical burden. It is considered that continuing voice input for hours like manipulating a keyboard will make the user""s voice hoarse and hurt his vocal cords.
On the other hand, studies of acquiring speech information from information other than sound information have been made conventionally. The vocal organs directly relating to vocalization of a human are the lungs 901 as an air flow mechanism, the larynx 902 as a vocalization mechanism, the oral cavity 903 and the nasal cavity 904 that assume the mouth/nasal cavity function, and the lips 905 that assume the articulation function, though the classification method varies from one technical book to another. FIG. 9 shows the arrangement of those organs (the lungs 901 are not shown). Studies of acquiring speech information from visual information of the lips 905 among these vocal organs have been made to provide techniques for people handicapped in hearing. It was pointed out that the speech recognition accuracy can be improved by adding visual information of movements of the lips 905 of a speaker to a speech recognition technique (C. Bregler, H. Hild, S. Manke, and A. Waible, xe2x80x9cImproving Connected Letter Recognition by Lipreadingxe2x80x9d, Proc. IEEE ICASSP, pp. 557-560, 1993).
Among speech recognition techniques using visual information of the lips, a technique with image processing that uses an image that is input from a video camera is employed most frequently. For example, in Japanese Unexamined Patent Publication No. Hei. 6-43897, as shown in FIG. 10, it was attempted to observe movements of the lips by capturing images of 10 reflective markers M0 to M9 themselves that were attached to the lips 905 of a speaker and a portion around them, detecting two-dimensional movements of the markers M0 to M9, and determining five lip feature vector components 801-805. In Japanese Unexamined Patent Publication No. Sho. 52-112205, it was intended to improve the accuracy of speech recognition by reading the positions of black markers attached to the lips and a portion around them from scanning lines of a video camera. This publication does not have any specific disclosure as to a marker extraction method; a two-dimensional image pre-process and feature extraction technique for discriminating the markers from density differences that are caused by shades formed by the nose and the lips, a mustache, skin color differences, a mole, a scratch or abrasion, etc. are needed.
To solve this problem, Japanese Unexamined Patent Publication No. Sho. 60-3793 proposed a lip information analyzing apparatus in which four high-luminance markers such as light-emitting diodes are attached to the lips to facilitate the marker position detection, movements of the markers themselves are imaged by a video camera, and pattern recognition is performed on a voltage waveform that is obtained by a position sensor called a high-speed multi-point X-Y tracker. However, even with this technique, when it is attempted to detect speech in a bright room, means is needed to prevent noise that is caused by high-luminance reflection light components coming from the glasses, a gold tooth, etc. of a speaker. Although preprocessing and a feature extraction technique for a two-dimensional image that is input from a television camera are needed for this purpose, the publication No. Sho. 60-3793 has no disclosure as to such a technique.
Several methods have been proposed in which features of a vocal organ are extracted by capturing an image of the lips and a portion around them directly without using markers and performing image processing on the image. For example, in Japanese Unexamined Patent Publication No. Hei. 6-12483, an image of the lips and a portion around them is captured by a camera and vocalized words are estimated by a back propagation method from an outline image obtained by image processing. Japanese Unexamined Patent Publication No. Sho. 62-239231 proposed a technique of using a lip opening area and a lip aspect ratio to simplify lip image information. Japanese Unexamined Patent Publication No. Hei. 3-40177 discloses a speech recognition apparatus retaining, as a database, correlation between vocalized sounds and lip movements to perform recognition for indefinite speakers. Japanese Unexamined Patent Publication No. Hei. 9-325793 proposed to lower the load on a speech recognition computer by decreasing the number of candidate words based on speech-period mouth shape information that is obtained from an image of the mouth of a speaker. However, since these related methods utilize positional information obtained from a two-dimensional image of the lips and a portion around them, for correct input of image information a speaker is required to open and close his lips clearly. It is difficult to detect movements of the lips and a portion around them in speech with a small degree of lip opening/closure and no voice output (hereinafter referred to as xe2x80x9cvoiceless speechxe2x80x9d) and speech with a small voice, let alone speech with almost no lip movements as in the case of ventriloquism. Further, the above-cited references do not refer to any speech detection technique that utilizes, to improve the recognition rate, speech modes such as a voiceless speech mode paying attention to differences between an ordinary speech mode and other ones. The xe2x80x9cspeech modexe2x80x9d indicating a speech state will be described in detail in the xe2x80x9cSummary of the Inventionxe2x80x9d section.
Several methods have been proposed that do not use a video camera, such as a technique of extracting speech information from a myoelectric potential waveform of the lips and a portion around those. For example, Japanese Unexamined Patent Publication No. Hei. 6-12483 discloses an apparatus that utilizes binary information of a myoelectric potential waveform to provide means that replaces image processing. Kurita, et al. invented a model for calculating a lip shape based on a myoelectric signal (xe2x80x9cPhysiological Model for Realizing an Articulation Operation of the Lipsxe2x80x9d, The Journal of the Acoustical Society of Japan, Vol. 50, No. 6, pp. 465-473, 1994). However, the speech information extraction using myoelectric potentials has a problem that a heavy load is imposed on a speaker because electrodes having measurement cords need to be attached to the lips and a portion around them.
Several inventions have been made in which tongue movements associated with speech of a speaker are detected by mounting an artificial palate to obtain a palatograph signal and a detection result is used in a speech detection apparatus. For example, Japanese Unexamined Patent Publication No. Sho. 55-121499 proposed means for converting presence/absence of contacts between the tongue and transmission electrodes that are incorporated in an artificial palate to an electrical signal. Japanese Unexamined Patent Publication No. Sho. 57-60440 devised a method of improving the touch of the tongue by decreasing the number of electrodes incorporated in an artificial palate. Japanese Unexamined Patent Publication No. Hei. 4-257900 made it possible to deal with indefinite speakers by causing a palatograph photodetection signal to pass through a neural network.
An apparatus that does not utilize tongue movements was proposed in Japanese Unexamined Patent Publication No. Sho. 64-62123 in which vibration of the soft palate is observed by bringing the tip portion of a bush rod into contact with the soft palate. Further, a study was made as to the relationship between the articulator shape and speech by mounting a plurality of metal pellets on a vocal organ, in particular the tongue in the oral cavity, and using an X-ray micro-beam instrument that measures the positions of the metal pellets (Takeshi Token, Kiyoshi Honda, and Yoichi Higashikura, xe2x80x9c3-D Observation of Tongue Articulatory Movement for Chinese Vowelsxe2x80x9d, Technical Report of IEICE, SP97-11, 1997-06). A similar study was made to investigate the relationship between the articulatory movement locus and speech by mounting magnetic sensors on a vocal organ in the oral cavity and using a magnetic sensor system that measures the position of the magnetic sensors (Tsuyoshi Okadome, Tokihiko Kaburagi, Shin Suzuki, and Masahiko Honda, xe2x80x9cFrom Text to Articulatory Movement,xe2x80x9d Acoustical Society of Japan 1998 Spring Research Presentation Conference, Presentation no. 3-7-10, March1998). However, these techniques have problems that natural vocalization action may be obstructed and a heavy load is imposed on a speaker because devices need to be attached to an inside part of a human body. These references do not refer to any speech detection technique either that utilizes, to improve the recognition rate, speech modes such as a voiceless speech mode paying attention to differences between an ordinary speech mode and other ones.
U.S. Pat. No. 3,192,321 proposed, as a technique for detecting speech information more easily than the above techniques, a speech recognition system that is a combination of a speech recognition technique and a technique of directly applying a light beam to the lips and an integument portion around them and detecting speech based on the state of diffused reflection light coming from the skin and the way the lips interrupt the light beam. Japanese Unexamined Patent Publication No. Hei. 7-306692 proposed a similar technique in which speech information of a speaker is detected by applying a light beam to the lips and a portion around them, detecting diffused reflection light coming from the surface of the integument with a photodetector, and measuring an intensity variation of the diffused reflection light. However, neither reflection plates such as markers nor specular reflection plates are attached to the lips and a portion around them. Since the relationship between the intensity of reflection light and positions and movements of the lips is not necessarily clear, a neural network is used for a recognition process. As described in the specification, being low in speech detection accuracy, this technique is for roughly categorizing phonemes as an auxiliary means of a speech recognition technique. Japanese Unexamined Patent Publication No. Hei. 8-187368 discloses, as an example of use of this technique, a game that involves limited situations and in which conversations are expected to occur. Japanese Unexamined Patent Publication No. Hei. 10-11089 proposed a technique of detecting speech by measuring the blood amount in the lips and a portion around them by a similar method in which the detector is limited to an infrared detecting device. These techniques are narrowly effective for speech with large movements of the lips and a portion around them, and difficult to apply to input of voiceless or small voice speech in which the degree of opening/closure of the lips is small. The specifications do not refer to speech modes such as a voiceless speech mode.
As for the above-described related techniques that are intended to detect speech from the shape of an articulator, methods and apparatus for correlating speech and a certain kind of signal that is obtained from the articulator are described in detail. However, the above-cited references do not refer to, in a specific manner, voiceless speech nor relationships between speech and signals associated with different speech modes. Further, there is no related reference that clearly shows problems that are caused by speech mode differences and countermeasures. Although there exists a related reference that refers to speech without voice output (Japanese Unexamined Patent Publication No. Hei. 6-12483), it does not describe the handling of speech modes that are most important for improvement of the recognition rate.
Problems to be solved by the speech input technique of the invention are as follows. These problems cannot be solved by the related speech recognition techniques in terms of the principle and have not been dealt with in a specific manner by related techniques that are intended to detect speech from shape information of an articulator.
(1) A speech detection apparatus cannot be used in offices, libraries, etc. where silence is required, because during speech input a voice of a speaker is annoying to nearby people.
(2) Related techniques are not suitable for input of a content that a speaker does not want to be heard by nearby people.
(3) There is psychological reluctance to speaking alone to a machine.
(4) A speaker who continues to speak with voice output has a physical load.
To solve the above problems, it is necessary to enable speech detection in a voiceless speech mode with entirely no voice output as well as in a speech mode with voice output (hereinafter referred to as a voiced speech mode). If this becomes possible, the problems (1) to (3) are solved because no voice is output to the environment in the voiceless speech mode in which there is almost no respiratory air flow and the vocal cords do not vibrate. Further, improvement is made of the problem (4) because voiceless speech requires only small degrees of mouth opening and closure and does not cause vibration of the vocal cords, reducing the physical load accordingly. Speech modes used in the invention are classified in FIG. 3.
It has been described above that the related techniques do not deal with, in a specific manner, voiceless speech nor speech modes in general. Naturally, as for related speech input techniques, studies have not been made of speech modes of voiceless speech, a whisper, and a small voice. On the other hand, in techniques of detecting speech from the shape of an articulator, it has become clear through experiments that the speech mode is an extremely important concept. In particular, it has turned out that even for speech of the same phoneme or syllable a signal obtained from the shape of an articulator varies with the speech mode that is a voiceless speech mode, a small voice speech mode, an ordinary speech mode, or a loud voice speech mode and the recognition rate of phonemes and syllables may greatly decrease if sufficient care is taken of the speech mode. An object of the present invention is to solve the problem of reduction in recognition rate that is caused by speech mode differences that has not been addressed by the related techniques and, particularly, to increase the recognition rate of voiceless speech that has not been discussed seriously in speech input techniques. To this end, the invention employs the following features.
To increase the rate of speech recognition based on input shape information of an articulator,
(1) at least one standard pattern is given to each speech mode;
(2) there is provided means for inputting, to a speech detection apparatus, information of a speech mode of a speech input attempt; and
(3) a standard pattern corresponding to input speech mode information is selected and then input speech is detected by executing a recognition process.
The above-mentioned problems can be solved if the speech modes include a voiceless speech mode. Naturally it is necessary to accept input of speech with voice output, and a speech recognition apparatus is required to switch among standard patterns in accordance with the speech mode.
The invention will be described below in more detail.
To solve the above-mentioned problems, the invention provides a speech detection apparatus comprising an articulator shape input section 101 (refer to FIG. 1 for reference numerals, hereinafter the same) for generating input data by measuring a movement of an articulator that occurs when a speaker makes speech from at least part of the articulator and an integument around the articulator; a speech mode input section 102 for allowing input of a speech mode of the speaker; and a speech detection section 103 for detecting the speech by comparing the input data generated by the articulator shape input section based on the speech of the speaker with one kind of standard pattern that is prepared in advance, wherein a speech detection process is executed when the speech mode that is input through the speech mode input section 102 coincides with a speech mode of the one kind of standard pattern.
In this configuration, speech detection is performed only in a prescribed speech mode and hence speech detection is performed in such a manner as to be suitable for the situation. In particular, if setting is so made that detection is performed only in a voiceless speech mode, the speech detection apparatus is advantageous for use in offices and in terms of the load that is imposed on a user.
Speech detection that is most suitable for each situation can be performed by preparing plural kinds of standard patterns and switching the detection mode in accordance with the speech mode. In this case, the plural kinds of standard patterns may include standard patterns of a voiceless speech mode, a voiced speech mode, and unvoiced speech mode. Alternatively, the plural kinds of standard patterns may include standard patterns of a voiceless speech mode and a voiced speech mode.
The speech mode may be determined based on the volume and the noise level of speech of a speaker. In this case, the noise level measurement time may be set at a short period t0 or the noise level may be an average noise level over a long period. Or the noise level may be determined by combining the above two methods.
Where plural kinds of standard patterns corresponding to a plurality of speech modes, respectively, are prepared, speech detection may be performed by selecting two or more kinds of speech modes and using two or more kinds of standard patterns corresponding to the selected speech modes.
In this case, one kind of speech mode may be selected based on a noise level measured in a short period t0 and another kind of speech mode may be selected based on an average noise level that is measured in a long period. (There may occur a case that one kind of speech mode is selected in a duplicated manner.)
There may be used standard patterns of a plurality of voiced speech modes that are featured in loudness, pitch, or length of voice.
The function to be performed in connection with input speech data is switched in accordance with the speech mode that is input through the speech mode input section. For example, the speech modes corresponding to the respective speech modes are a function of allowing input of coded text information, a function of giving an instruction relating to a particular operation, and a function of stopping input. Further, switching may be made automatically, in accordance with the speech mode, among plural kinds of application software.
According to another aspect of the invention, to solve the above-mentioned problems, there is provided a speech detection apparatus comprising an articulator shape input section 101 for generating input data by measuring a movement of an articulator that occurs when a speaker makes speech from at least part of the articulator and an integument around the articulator; a speech mode input section 102 for allowing input of a speech mode of the speaker; and a speech detection section 103 for detecting the speech by comparing the input data generated by the articulator shape input section based on the speech of the speaker with a standard pattern for voiceless speech that is prepared in advance.
With this configuration, speech detection can be performed without a speaker""s emitting a noise or imposing an undue load on a speaker. In particular, speech can be detected with high accuracy in the voiceless speech mode because the shape variation of an articulator that is caused by speech is restricted and hence the deviation of the shape is small.
The manner of measuring features of speech is not limited to the case of using movements of an articulator and they can be measured in other various manners.