This invention relates to a technology used in a field wherein voice information is coded and input to an information machine such as a computer or a wordprocessor, and in particular is appropriate for detecting voice information in a noisy environment or a conference, etc., where many people talk at the same time. The technology is also used as a voice input apparatus for providing barrier-free machines enabling smooth information transmission to deaf-and-dumb persons, hard-of-hearing persons, and aged people.
The voice input apparatus of a machine aims at enabling user's voice to be input precisely and moreover at high speed in any environment. Hitherto, breath apparatuses for analyzing voice frequency, thereby recognizing and processing speech have been proposed. However, in such a speech recognition technique, degradation of the recognition percentage in an environment wherein noise occurs is at stake. To prevent this problem, it is desirable to get utterance information from information other than voice. Human being vocal organs involved directly in producing a voice are lungs 901 of an air stream mechanism, a larynx 902 of a voice producing mechanism, an oral cavity 903 and nasal cavity 904 are taking charge of ora-nasal process, and lips 905 and a tongue 906 governing articulation process, as shown in FIG. 9, although the classification varies from one technical document to another. Research on getting utterance information from visual information of the lips 905 has been conducted as a technology for hearing handicapped persons. Further, it is pointed out that speech recognition accuracy is enhanced by adding visual information of a motion of the lips 905 of the speaker to voice information (C. Bregler, H. Hild, S. Manke and A. Waible, "Improving connected letter recognition by lipreading," Proc. IEEE ICASSP, pp. 557-560, 1993, etc.,).
An image processing technique using images input through a video camera is most general as a breath recognition technique based on visual information of lips. For example, in the Unexamined Japanese Patent Application Publication No. Hei 6-43897, images of ten diffuse reflective markers M0, M1, M2, M3, M4, M5, M6, M7, M8, and M9 attached to the lips 905 of a speaker and the surroundings of the lips are input to a video camera, two-dimensional motion of the markers is detected, five lip feature vector components 101, 102, 103, 104, and 105 are found, and lip motion is observed (FIG. 10). In the Unexamined Japanese Patent Application Publication No. Sho 52-112205, positions of black markers put on lips and periphery thereof are read from on video camera scanning lines for improving speech recognition accuracy. Although no specific description on a marker extraction method is given, the technique requires two-dimensional image preprocessing and feature extraction technique for discriminating density differences caused by shadows produced by a nose and lips, mustache, beard, whiskers, and skin color differences, and moles, scars, etc., from markers. To solve this problem, in the Unexamined Japanese Patent Application Publication No. Sho 60-3793, a lip information analysis apparatus is proposed which is accomplished by putting four high-brightness markers such as light emitting diodes on lips for facilitating marker position detection, photographing motion of the markers with a video camera, and executing pattern recognition of voltage waveforms provided by a position sensor called a high-speed multipoint X-Y tracker. However, to detect voice in a light room, the technique also requires means for preventing noise of a high-brightness reflected light component produced by spectacles, gold teeth, etc., of a speaker. Thus, it requires preprocessing and feature extraction technique of two-dimensional images input through a television camera, but the technique is not covered in the Unexamined Japanese Patent Application Publication No. Sho 60-3793. Several apparatuses for inputting lips and surroundings thereof directly into a video camera without using markers and performing image processing for feature extraction of vocal organs are also proposed. For example, in the Unexamined Japanese Patent Application Publication No. Hei 6-12483, an image of lips and surroundings thereof is input into a camera and is processed to produce a contour image and a vocalized word is estimated by a back propagation method from the contour image. Proposed in the Unexamined Japanese Patent Application Publication No. Sho 62-239231 is a technique for using a lib opening area and a lip aspect ratio for simplifying lip image information. Designed in the Unexamined Japanese Patent Application Publication No. Hei 3-40177 is a speech recognition apparatus which has the correlation between utterance sound and lip motion as a database for recognizing unspecific speakers. However, the conventional methods handle only position information provided from two-dimensional images of lips and periphery thereof and is insufficient to determine phonemes having delicate lip angle change information and skin contraction information. The conventional two-dimensional images processing methods having large amounts of information to extract markers and features, thus are not appropriate for speeding up.
Several methods without using a video camera are proposed; techniques of extracting utterance information from an electromyogram (EMG) of the surroundings of lips are proposed. For example, in the Unexamined Japanese Patent Application Publication No. Hei 6-12483, an apparatus using binarization information of an EMG waveform is designed as alternative means to image processing. In Kurita et al., "A Physiological Model for the Synthesis of Lip Articulation," (The Journal of the Acoustical Society of Japan, Vol. 50, No. 6 (1994), pp. 465-473), a model for calculating a lip shape from an EMG signal is designed. However, the utterance information extraction based on the EMG involves a problem of a large load on the speaker because electrodes with measurement cords must be put on the surroundings of the lips of the speaker. Several techniques of attaching an artificial palate for obtaining a palatographic signal, thereby detecting a tongue motion accompanying voice producing of a speaker for use as a voice input apparatus are also invented. For example, in the Unexamined Japanese Patent Application Publication No. Sho 55-121499, means for converting the presence or absence of contact between a transmission electrode attached to an artificial palate and a tongue into an electric signal is proposed. In the Unexamined Japanese Patent Application Publication No. Sho 57-160440, the number of electrodes attached to an artificial palate is decreased for making good tongue touch. In the Unexamined Japanese Patent Application Publication No. Hei 4-257900, a palatographic light reception signal is passed through a neural network, whereby unspecific speakers can be covered. In addition to use of a tongue motion, a device of bringing the bush rod tip into a soft palate, thereby observing vibration of the soft palate is proposed in the Unexamined Japanese Patent Application Publication No. Sho 64-62123. However, the device needs to be attached to the inside of a human body, thus there is a possibility that a natural speech action may be disturbed, and the load on the speaker is also large. It is desirable to eliminate the need for contacting the human body as much as possible as a utterance state detection apparatus or device.
A position detection method according to prior technology for putting markers is shown by taking the Unexamined Japanese Patent Application Publication No. Hei 6-43897 as an example (FIG. 10). In the prior technology, images of markers M0, M1, . . . , M9 are input from the front where the feature of lips 905 and the periphery thereof can be best grasped. Thus, position of the markers movement accompanying utterance up and down 101, 102, 104 and from side to side 103, 105 can be detected in two dimensions, but back-and-forth move of the markers M0, M1, . . . , M9 accompanying utterance cannot be captured (Daivid G. Stork, Greg Wolff, Earl Levine, "Neural network lipreading apparatus for improved speech recognition," in Proc. IJCNN, IEEE, Vol. II 1992). To detect front and back-and-forth motion in three dimensions at the same time, in the prior technology, several television cameras need to be provided for stereoscopically measuring the positions of vocal organs of lips, etc. The technologies are introduced as real-time three-dimensional coordinate output technologies on optical measuring instrument exhibitions, etc., from a number of manufacturers. The measurement time sampling rate is 60 Hz and markers are upsized (about 20 mm in diameter) and are made spherical for facilitating marker extraction processing in order to enable high speed; the marker images show the same round shape independently of the shooting position. Further, the markers are colored in striking colors so that they can be easily extracted. However, such large markers cover most of lips and lip periphery and thus are not appropriate for detecting delicate motion of the lips and lip periphery with high accuracy. To improve this defect, if the markers are downsized and are made like thin sheets so as to not disturb utterance, two-dimensional image processing to detect the markers and extract the feature amounts of vocal organs takes time and it becomes difficult to detect positions in real time, as described with the Unexamined Japanese Patent Application Publication No. Hei 6-43897. Three-dimensional measurement, which uses two or more cameras at the same time, has disadvantages of complicated image processing, high equipment costs, and a large size.
The Unexamined Japanese Patent Application Publication No. Hei 7-306692 is disclosed as a technology seemingly similar to a technology proposed in this invention to solve the problems. In the technology proposed in the Unexamined Japanese Patent Application Publication No. Hei 7-306692, lips and periphery thereof are irradiated with a ray of light, diffused reflected light from the skin surface is detected at a light receptor, and strength change of the diffused reflected light is measured, thereby detecting voice information of the speaker. However, diffuse reflection plates such as markers and specular reflection plates of this invention are not put on lips or periphery thereof. The relationship between the reflected light strength and position and motion of the lips is not necessarily clear and a neural network is used for recognition processing. This technology is explained as a technique having low voice detection accuracy and roughly classifying phonemes into categories as auxiliary means of voice recognition technology, as described in the specification. Games with limited situation and expected conversation are shown as one application example in the Unexamined Japanese Patent Application Publication No. Hei 8-187368. In contrast, this invention provides a technology of putting specular reflection plates on skin portions of vocal organs and periphery thereof for specifying measurement points and finding position and angle change of the specific portions accurately as geometrical optics using specular reflection; the invention is entirely different from the Unexamined Japanese Patent Application Publication No. Hei 7-306692.
The problems to be solved by this invention are to lessen the load on the user and improve voice detection percentage as compared with the prior technology and enable voice detection in real time. The conventional voice detection technology using an image is to input a two-dimensional image of a lip periphery through a television camera, etc., and extract the features at the pronunciation time, thereby detecting voice. Specifically, preprocessing, feature extraction, and classification description are executed for an input image of the lip periphery and optimum matching with a standard pattern is executed for detecting voice. The preprocessing technique includes classified into noise removal, density conversion, distortion correction, normalization, etc., and the feature extraction technique is classified into line extraction, area extraction, texture extraction, etc. In the line extraction, differential operation and second-order differential operation of an input image are performed for clarifying the contour of the input image and binarization processing is performed. If the line thus extracted contains a defective point, a curve application technique is used to correct the defective point. For the area extraction, a density histogram, color image color difference, etc., is used. The periodic fine structure feature of an image provided by two-dimensional Fourier transformation is used to extract the texture of the image. As the classification description technique, feature vectors capable of classifying voices are defined for extracted areas and extracted lines and the voice best matching a standard pattern statistically in a feature space formed by the feature vectors is selected. Also, a classification description technique for focusing attention on the feature pattern structure phase and executing syntax pattern recognition is proposed. In recent years, a method of applying a neural network to structure determination and phoneme detection has been proposed. The techniques are extremely intricate as described above; particularly in the preprocessing and feature extraction, the techniques take time in two-dimensional image processing and are improper for voice detection in real time. In giving utterance in a small voice, etc., with small lip opening and closing amounts, move amounts of markers put on lips and periphery thereof are small and positions cannot be detected with good accuracy.
On the other hand, the direct measurement technology of the state and positions of utterance organs is high in target part measurement accuracy, but the load on the user is extremely large; even if the state of a specific articulation organ is measured with high accuracy, voice produced by total motion of articulation organs cannot be detected.