The present invention relates to a system for performing interpretation between an aurally normal person and an aurally handicapped person. Particularly, the invention is concerned with a sign language generation apparatus for converting a composition inputted by the aurally normal person into a sign language and displaying it in the form of an animation of CG (computer graphics), and a sign language interpretation apparatus capable of performing display of results of bidirectional interpretation between the aurally normal person and aurally handicapped person and confirmation of the results.
The sign language is a language for aurally handicapped persons who communicate information to a partner by using such means as facial expressions and the position, direction, moving direction, and moving speed of the hand, and is a system different from that of a spoken language used by aurally normal persons which has been developed by voice taking a leading part in the spoken language. Accordingly, when the aurally handicapped person converses with the aurally normal person, a conversation carried out in sign language is easier and higher in communication speed than a discussion by means of writing or lipspeech carried out using a spoken language belonging to a system of voice language. Therefore a conversion system between spoken language and sign language, that is, an automatic sign language interpretation apparatus has been desired.
The sign language interpretation apparatus consists of a sign language recognition apparatus for recognizing a sign language inputted by the aurally handicapped person and converting it into a spoken languages and a sign language generation apparatus for converting a spoken language inputted by the aurally normal person into a sign language and generating it in the form of an image.
Conventional techniques concerning the sign language generation apparatus will first be described. As a method of generating a sign language, a method of linking together images photographed in units of word and displaying the linked images or a method using animation has hitherto been available. Enumerated as an example of the former is A Sign Language Generation System Based on Optical Disc by Kawai, Tamura and Okazaki, Television Magazine, 14 (1990) (Literature 1), and enumerated as an example of the latter is Basic Study on Japanese Sign Language Expression Based on Animation by Terauchi, Nagashima (Yu), Mihara, Nagashima (Hide) and Yamato, Information Processing Society of Japan, Human Interface, 41-7 (1992) (Literature 2). A sign language image generated in Literature 2 is a two-dimensional line drawing and animation is generated by selecting image patterns from a series of operations and registering them, and interpolating a space between the image patterns automatically. Also, the joint position is inputted every key frame from a keyboard while making reference to a sign language operation code.
In order for a conversation in sign language to be carried out smoothly, the display speed of animation is desired to be high to such an extent that the rhythm of the conversation is not destroyed. Also, in order for the shape of the hand to be recognized correctly, the image needs to be three-dimensional. Further, since the hand is complicated in structure and small in size and the meaning of a sign differs depending on an angle of the finger joint, recognition can be made more easily when the display of an angle of hand or finger is expressed somewhat exaggeratedly than when expressed practically.
In a synthesis method using images, as they are, obtained by photographing a practical sign language as in Literature 1, the image is three-dimensional but the amount of information is large. A sign language CG dictionary in which sign language words are registered must store 2000 or more sign language words of which one has about 60 color images; therefore, when the amount of information to be stored per one sign language word is large, a large memory capacity is needed. Also, preparation of an image of the same person under the same condition is difficult to achieve, imposing a significant problem on dictionary maintenance. Further, it is very difficult to perform natural interpolation for the elbow and arm between displayed word images.
On the other hand, in the conventional technique using CG of Literature 2, the CG is of a two-dimensional line drawing; therefore, the depth is not known and recognition of the detailed shape of the finger is difficult to achieve. Also, since the elbow position is inputted from the keyboard every key frame, the operation is sophisticated, imposing a problem as to which portion of a sign language word should be selected.
In addition, a conventional technique using a glove type sensor for recognition of sign languages has been proposed in, for example, a sign language translation apparatus of JP-A-4-134515 (Literature 3). But, it is difficult to use data, as it is, of the glove type sensor for the sake of registering sign language word patterns used to generate and display sign languages. This is because fingers of a human body model wearing the glove type sensor cannot be expanded or bent sufficiently and therefore some data processing is needed in order that the data from the glove type sensor as it is can be used for generation and display of sign languages. Conceivably, this is due to a habit of the person performing a sign language while wearing the glove type sensor or due to accuracy limitations of the glove type sensor. Also, naturalness of sign language animation will sometimes be impaired owing to a fine vibration (noise) contained in data of coordinate positions. Further, in the case where a set of standard glove type sensors is used, information about only a portion which is frontal of the wrist can be obtained and hence the elbow position and the joint angles at the shoulder and elbow must be determined through calculation.
Next, a conventional technique concerning the whole of a sign language interpretation apparatus will be described.
Available as conventional techniques concerning the sign language interpretation apparatus are a technique for performing translation of a sign language into a corresponding voice language and display thereof, and a technique for performing translation of a voice language into a sign language and display thereof. In the translation of sign language into voice language, sign language words contained in a sign language are recognized through pattern retrieval and a neural network, and corresponding word names in voice language are determined and displayed. Further, a technique is available in which a series of recognized words is shaped into the form of a sentence and delivered. Also, in the translation of voice language into sign language, as mentioned in the previous Literatures 1 and 2, an inputted sentence in voice language may be decomposed into a series of words and CG in sign language, or a video image in sign language corresponding to each word is displayed. (In the present specification, "voice language" is not used in the meaning of a language per se uttered in voice but is used in a wider meaning of so-called expressions used in spoken language. It may be inputted using a separate medium such as a keyboard. It may be considered to be substantially synonymous with "spoken language".)
On the other hand, in a voice recognition apparatus and a machine translation apparatus involved in techniques related to the sign language interpretation apparatus, there has been proposed a technique in which results of recognition and results of translation are displayed, and an input person performs confirmation and modification. In this technique, a method is principally adopted wherein a plurality of candidates are determined, and they are enumerated and displayed to the input person to permit the input person to select a correct candidate from them. In a method of displaying candidates, a word name or sentence corresponding to an inputted voice is displayed by, for example, voice recognition and a Japanese word or an English word or sentence corresponding to a Japanese sentence is displayed by Japanese-English translation.
The above conventional techniques concerning the sign language interpretation apparatus do not refer to methods for display, confirmation and modification of results of interpretation which are important to the sign language interpretation apparatus serving as a communication support.
In the case where methods for display, confirmation and modification used in the aforementioned voice recognition apparatus and machine translation apparatus are utilized, if for example a sign language is merely recognized and translated into a voice language and results of translation are simply displayed, then it is impossible for the input person to confirm whether an inputted sign language is communicated to a partner while being recognized correctly, or to make a correction in the event that the inputted sign language is erroneously communicated, because some aurally handicapped persons have difficulties in understanding voice language. Also, in the translation of voice language into sign language, the aurally normal person cannot confirm whether the translation is done to have a correct meaning if only a sign language standing for results of translation is presented.