A mechanical apparatus performing the movement similar to the movement of the human being, as a living organism, using electrical or magnetic operations, is termed a robot. In this nation, the robot began to be used towards the end of sixties. The majority of the robots used were industrial robots, such as manipulators or transfer robots, aimed to automate productive operations in a plant for dispensing with human labor.
Recently, development of a utility robot, supporting the human life as a partner of the human being, that is, supporting the human activities in various situations of our everyday life, such as in living environment, is proceeding. In distinction from the industrial robot, the utility robot has the faculty of learning the method for adapting itself to the human users different in personality or to variable aspects in different situations of the human living environment. For example, legged mobile robots, such as a pet type robot, simulating the physical mechanism or the movements of four-legged animals, such as dogs or cats, or a humanoid robot, designed after the bodily mechanism of the human being, standing erect and walking on two legs, and after their behavior, are already being put to practical use. These legged mobile robots, having the appearance extremely close to that of the animal or the human being, capable of behaving similarly to the animals or the human being, in contradistinction from the industrial robots, and also of performing various entertainment-oriented actions, are sometimes called entertainment robots.
Among the legged mobile robots, there are those having small-sized cameras, equivalent to eyes, and sound collecting microphones, equivalent to ears. In this case, the legged mobile robots execute image processing on acquired images to recognize the surrounding environment, input as the image information, or to recognize the “language” from the input surrounding sound.
In particular, the technique of recognizing the speech, acquired form outside, for converting it into letters or characters, or of recognizing the speech to make a reply, is being used not only in the legged mobile robots but also in various electronic equipment, such as personal computers.
In conventional speech recognition techniques, a dictionary for speech recognition (referred to below as a dictionary for recognition), in which the pronunciation of a given word is stored in association with its notation, is used for speech recognition. This technique has a deficiency that a word not registered in the dictionary for recognition cannot be recognized. Moreover, in recognizing the pronunciation of a word sequence, such as a sentence, the words registered in the dictionary for recognition must be combined together. That is, if there is any word or words, not as yet registered in the dictionary for recognition, is contained in a sentence, such sentence is misrecognized, or cannot be recognized.
Taking an example of “ (name of a station, uttered as kita shinagawa)”, if “” has not been registered in the dictionary for recognition, the pronunciation of “” or the pronunciation containing “”, such as the speech comprised of a sequence of words, e.g., “  (Where is Kitshinagawa?, uttered as kitashinagawawa, dokodesuka) “cannot be recognized, or the portion of “” is misrecognized. Thus, for allowing to recognize words not registered in the dictionary for recognition, unregistered words need to be registered anew additionally.
In the dictionary for recognition, owned by the speech recognition device for enabling speech recognition, a “word symbol” for a given word, as an identifier for distinguishing the word from other words, is associated with a “PLU sequence” representing the pronunciation information for the word in question. The PLU (phoneme-like unit) is to be an acoustic or phonetic unit. The speech uttered can necessarily be expressed as a combination of PLUs (PLU sequence).
Therefore, if a word is to be registered in the dictionary for recognition, it is only necessary to add a word symbol and an associated PLU sequence. It is noted however that, in order for a word symbol and an associated PLU sequence to be added in this manner, the notation of “” or “kitashinagawa” needs to be able to be input directly, using suitable inputting means, such as a keyboard.
Thus, in case of e.g., a robot apparatus not having such inputting means, e.g., a keyboard, there is such a method in which the pronunciation of a word acquired as speech is acoustically recognized to obtain a PLU sequence for an unknown word. In this case, recognition is with use of a garbage model. The garbage model, applied only to Japanese, is such a model in which the speech is represented as a combination of “phonemes” as basic unit for pronunciation, or as a combination of kana (Japanese syllabary) as basic units for word reading.
In a conventional speech recognition device, a garbage model is applied to obtain the results of recognition by speech, a word symbol is applied to the results of recognition and these are associatively registered as a new word in the dictionary for recognition.
It should be noted that the “phoneme” and “PLU” are approximately synonymous and that the “PLU sequence” represents the pronunciation of a word formed by concatenation of plural PLUs.
The conventional technique for speech recognition, applying the garbage model, has a deficiency that recognition accuracy tends to be lowered by a delicate difference in the way of pronunciation from user to user, even though the words uttered is the same, by weakness of particular phonemes, such as /s/ in a beginning part of the word, that are necessarily liable to be misrecognized, due to changes in the phonemes caused by surrounding noise, or due to failure in detecting the speech domains.
In particular, if the speech recognition device is applied to the robot apparatus, the speech collecting microphone on the side speech recognition device is separated away from the user, in a majority of cases, so that mistaken recognition is likely to be produced frequently.
In case “ (name of a station, uttered as kita shinagawa)” is to be recognized, the result of recognition tends to be recognized as being for example a PLU sequence “hitotsunanoga” or “itasunaga:” which is analogous with but is not the same as “” in pronunciation. If speech recognition is performed using a dictionary for recognition, the word registration of which has been made by such method, not only is the recognition accuracy lowered, but also display errors due to mistaken recognition is produced. That is, since incorrect PLU sequences are conferred to newly registered words, the accuracy with which this word is recognized is lowered.
In case the other user than the user who input a word utterd the word, even though “” has been registered in the dictionary for recognition, there were occasions where the pronunciation containing the word “” can not be recognized due to a difference in an accent from user to user.
Moreover, if the results of speech recognition are converted into letters or characters for display, there are occasions where mistaken letters or characters are demonstrated, because the information as to display has not been conferred to the newly registered word. If, after registering “” with speech, the user uttered “ (I want to go to Kitashinagawa, uttered as kitashinagawa ni yukitai)” to the speech recognition device, the display may be “hitotsunanoga  (I want to go to hitotsunanoga, uttered as hitotsunanoga ni yukitai)” or “ (I want to go to hitotsunanoga, uttered as hitotsunanoga ni yukitai)”, even though the pronunciation “” has correctly been recognized by the speech recognition device. There is also an inconvenience that, when the speech recognition device repeats the PLU sequence as the results of recognition by speech synthesis, only the portion of the PLU sequence of the synthesized newly recognized word be uttered as an unnatural junction in the entire PLU sequence.
Additionally, if a new word is registered by the garbage model, it is not possible to register the information concerning the attributes of the registered word, such as part-of-speech or meaning. For example, if “” has been registered, it is not possible to register the information as to whether the word is a noun or a place name. The result is that, if the grammatical rule for particular expressions, such as “<>+++ (where is the <word indicative of place name>?, uttered as <chimeiwo arawasugo>+wa+doko+desu+ka)” is pre-recorded in for example the grammar for dialog or a language model for recognition, such rule cannot be applied to the newly registered word. Although word attributes can be input with speech, at the time of registration, the user is required to be aware of the word attributes. Moreover, it is troublesome for the user to input not only the word but also the word attribute for registration.