This application is based upon and claims benefit of priority of Japanese Patent applications No. Hei-11-20349 filed on Jan. 28, 1999 and No. Hei-11-210819 filed on Jul. 26, 1999, the contents of which are incorporated herein by reference.
1. Field of the Invention
The present invention relates to an apparatus for determining appropriate series of words by selecting those from among candidate series of words delivered from a device that recognizes words inputted from outside in a form such as spoken voice.
2. Description of Related Art
Voice recognition systems in which words included in spoken voice are compared with words contained in an installed dictionary and the words that highly accord with the dictionary are output in a form of a series of words are known hitherto. This kind of system has been used as an apparatus for controlling various devices by means of machine-recognized spoken voice. For example, a user of an on-board navigation system gives a command to the system by his voice, and the system automatically searches and displays desired information based on the command recognized by the system.
In the conventional voice recognizing systems, two methods are generally used, that is, a continuous word recognition method and a word spotting method. Each word contained in user""s voice, e.g., xe2x80x9cOkazaki, a restaurant X, ramen,xe2x80x9d can be recognized by the machine, however, a certain mis-recognition cannot be avoided. The conventional navigation system, therefore, generates plural candidate results of recognition and talks back to the user one of the recognition results for which user""s confirmation is required. The user rejects the talked-back words if those are different from what he spoke. Then, the system presents another candidate and asks for his confirmation. This process is repeated until the correct words are presented to the user. This is time consuming, and it may take a long time before a correct recognition result is finally shown to the user.
On the other hand, the word spotting method has a certain advantage. For example, informally spoken words, such as xe2x80x9cWell . . . , I wanna eat ramen at X (name of a restaurant) in Okazaki.xe2x80x9d can be analyzed, and keywords such as xe2x80x9cOkazaki,xe2x80x9d xe2x80x9cX,xe2x80x9d xe2x80x9cramenxe2x80x9d and xe2x80x9ceatxe2x80x9d can be picked up. For this reason, the word spotting method is rapidly attracting attention recently in the voice recognition field. However, this method generates a number of candidate series of words, which is called a lattice consisting of a group of words including time-related information and probability information, and it is rare that a small number of meaningful candidates are presented. The number of words that can be recognized in the word spotting method at present is about 100, but it is expected to be increased to more than 1000 in the near future. As a result, the number of candidate series of words generated from the lattice-will be increased to a much larger number. Therefore, there is the same problem as in the continuous word recognition method. The problem resulting from too many candidates may be more serious in the word spotting method than in the continuous word recognition method.
The problem mentioned above exists not only in the voice recognition but also in written character recognition and in image recognition. Input data in any form are compared with the data contained in a memory, and the data which highly accord with the data in the memory are selected as candidate results. If the number of candidates are too many, including inappropriate candidates, it takes a long time to finally reach a correct result. Moreover, it is difficult to return the conventional system to a normal operating mode when a serious error occurs in the system due to various reason such as input noises or circumstance changes. For example, if a user inadvertently speaks a name of a station even though he intends to input a name of a place, the system enters a mode for selecting a station name and does not react any more to newly input place names. It is preferable, on the one hand, to reduce the number of candidate recognition results by conferring with a dictionary, but there is a possibility, on the other hand, that the system does not return to a desired mode once it enters into other modes. If the system enters into an undesired mode and the user does not know how to escape therefrom, he or she is trapped in a serious trouble.
The present invention has been made in view of the above-mentioned problem, and an object of the present invention is to provide an improved apparatus for selecting and determining appropriate series of words. Another object of the present invention is to provide a system that can easily return to a desired mode even if the system once enters into an undesired mode.
A system, or an apparatus according to the present invention recognizes and determines an appropriate series of word based on user""s voice inputted to the system. A user""s utterance including a series of words is fed to a voice recognition device, and then plural candidate series of words are generated in the system. A few appropriate series of words are selected from the plural candidates based on verification as to whether the candidates are statistically appropriate as a natural language. In other words, plural candidate series of words are filtered through the system so that only a few (e.g., three or less) appropriate ones are shown to the user for his/her final confirmation.
The appropriateness of a series of words is evaluated based on various factors including grammar, meaning, a common sense, user""s personal information, a sentence structure, likelihood values attached to each word and the series of words, situations surrounding the user, and so on. Among those factors, evaluation based on scores given in sentence structure tables plays an important role. All possible orders of words included in a series of words are listed in the sentence structure tables, and an evaluation score is given to each order of words. Series of words having a higher score than a predetermined level are selected as appropriate ones.
To determine the appropriate series of words, present invention also provides various processes. One is progressive searching, in which a few candidate words corresponding to a first word inputted are generated in the system, referring to a recognition dictionary. Then, the user of the system selects a proper word from the candidates, and the system dynamically restructures the recognition dictionary so that it only includes words relating to the selected word. This process is repeated until a whole series of words inputted is all recognized and determined. It is also possible to show the user the candidate words only when the next word is not fed within a predetermined period of time.
Another is multi-stage processing in which a dialogue topic or a user""s request is first determined, with reference to the recognition dictionary, from the series of words inputted. Then, the recognition dictionary is restructured so that it only includes words relating to the determined dialogue topic. The restructured dictionary is used for generating candidate words corresponding a word included in the series of words. In restructuring the recognition dictionary, various factors are taken into consideration. Those factors include networks among words and dialogue topics, continuity of a dialogue context, situations surrounding the user, and so on.
Although recognition errors are minimized in the processes of the present invention, it is also important to properly handle errors if such occur. When a erroneous recognition result is shown to the user, he/she responds to the system by uttering negating words such as xe2x80x9cIt""s wrong.xe2x80x9d Then, the system provides some alternatives, such as entering a help mode, inquiring a user""s answer, showing multiple-choices, or initializing the system. Thus, the situation where the user is trapped in the trouble caused by mis-recognition is avoided.
The present invention is also applicable to other systems than the voice recognition system. For example, a series of hand-written words or picture images such as a finger language converted into a series of words are, also able to be processed in the apparatus of the present invention.
According to the present invention, only a few appropriate series of words are selected from among many candidates based on proper screening processes including restructuring the recognition dictionary. Therefore, recognition errors are minimized, and time required to determine a correct series of words is shortened.