1. Field of the Invention
The present invention relates to an interactive device that performs a response action corresponding to the contents of utterance by a user.
2. Description of the Invention
Interactive devices are utilized in communication robots and the like capable of communicating with users and are required to recognize user's input voice that is continuous without clear breaks between sentences. Accordingly, the interactive devices employ a continuous voice recognition system using statistical voice recognition processes. One-path search and multi-path search are known as search algorithms in such statistical voice recognition processes.
The one-path search is a method to search word candidates only once in the input voice as shown in FIG. 16A. Since the one-path search performs a continuous search throughout duration of utterance by the user from the starting end to the terminal end, the one-path search has an advantage of a relatively high recognition accuracy. The one-path search, however, has a difficulty using complicated models (acoustic models and language models) which requires processing of an increasing amount of data as the vocabulary becomes large.
For this reason, the multi-path search has been well used that performs a plurality of searches on the contents of the user's utterance, as shown in FIG. 16B. In the multi-path search, as shown in FIG. 16B, first, a first-path search is performed in a direction from the starting end to the terminal end of the utterance duration using simple (coarse) models, and then, a second-path search is performed in a direction from the terminal end to the starting end of the utterance duration using complicated (sophisticated) models. The multi-path search is advantageous in that it is easy to implement a multi-path search software into the device because the volume of the entire computing data is reduced due to the switch from a simple model in the first-path search to a sophisticated model in the second path search.
On the other hand, the multi-path search shown in FIG. 16B has a problem that a voice recognition result is unable to be output until the second-path search is completed up to the terminal end of the utterance duration. That is, in the multi-path search as shown in FIG. 16B, which does not allow a timely sequential output of recognition results halfway through the utterance duration, a key phrase to determine a response action is unavailable until a recognition result is output at the terminal end of the utterance duration even if the key phrase has already appeared halfway through the utterance duration. This is the reason that there is a need to make quick decision of voice recognition results halfway through the utterance duration according to certain criteria so that the voice recognition results can be output sequentially while the voice is being input by the user.
To solve this problem, there have been proposed continuous recognition techniques of dividing an utterance duration into sections of a predetermined length, allowing quick decision of recognition results with respect to these sections and sequentially outputting the thus obtained recognition results, as shown in FIG. 16C (For example, Japanese Unexamined Patent Publication No. 6-259090 (See FIG. 1) (hereinafter referred to as Patent Document 1); O. Segawa, K. Takeda and F. Itakura: Continuous Utterance Recognition without End-point Detection, Voice Language Information Processing, 34-18, pp. 101-106, December 2000 (hereinafter referred to as Nonpatent Document 1); T. Imai, H. Tanaka, A. Ando and H. Isono: Progressive Early Decision of Utterance Recognition Results by Comparing Most Likely Word Sequences, The Journal of the Institute of Electronics, Information and Communication Engineers (J. IEICE), D-II, Vol. J84-D-II, No. 9, pp. 1942-1949, September 2001) (hereinafter referred to as Nonpatent Document 2). Such continuous recognition techniques are utilized mainly in automation of, for example, phonetic transcription of utterances and preparation of subtitles using a voice recognition system.
Patent Document 1 proposes a voice interactive system that recognizes input voice in an utterance duration, extracts a sequence of semantic expressions from the input voice, divides the sequence of semantic expressions into units of meaning and performs processing of each unit of meaning. Nonpatent Document 1 proposes a technique of setting a frame interval, for which quick decision is made, to 1.5-3 seconds and searching, in a first-path search, a last word in and around each frame interval to thereby prevent a decrease in a recognition rate resulting from a short utterance duration. Nonpatent Document 2 proposes a continuous voice recognition technique of searching a last word that enables quick decision by comparing most likely word sequences for every interval of 300 msec in a one-path search, thereby reducing an average delay time in word decision to 512 msec.
However, the techniques of Patent Document 1 and Nonpatent Documents 1 and 2, which use a result of the first-path search (hereinafter referred to as the first path where appropriate) for specifying intervals at which user's utterance is divided into frames for quick decision, have a problem that a word division error, if any, in a search result of the first path affects a search result in the second path search (hereinafter referred to as the second path where appropriate), resulting in a decreased recognition rate.
Further, continuous recognition through voice interaction requires faster responses than that for phonetic transcription of utterances and preparation of subtitles, and thus requires quick decision to be made at shorter intervals. In the techniques of Patent Document 1 and Nonpatent Documents 1 and 2, if the response speed is increased by shortening each of the intervals by which user's utterance is divided into frames for quick decision, the length of each of voice recognition sections is reduced, making it difficult to search for word boundaries and thereby decreasing the recognition rate.
The present invention has been made in view of the above problems, and it is an object of the present invention to provide an interactive device which allows quick decision of utterance recognition results and sequential output of the utterance recognition results and which diminishes a decrease in the recognition rate even if user's utterance is divided by a short interval into frames for quick decision.
In order to solve the above problems, the present invention provides an interactive device that recognizes input voice of a user and thereby contents of utterance of the user and performs a predetermined response action corresponding to the recognized contents, the interactive device comprising:
a recognition section setting means that sets a recognition starting point to an utterance starting end frame serving as a starting end of the user's utterance in the input voice and sets a recognition terminal point to a frame which is a predetermined length of time ahead of the recognition starting point to thereby set a recognition section throughout which voice recognition is performed,
a voice recognition means that performs voice recognition for the recognition section,
a response action determining means that, if a recognition result by the voice recognition means includes a key phrase, determines a response action associated with the key phrase, and
a response action executing means that executes the response action determined by the response action determining means,
the recognition section setting means repeatedly updating the frame set as the recognition terminal point to a frame which is the predetermined length of time ahead of the recognition terminal point, to thereby set a plurality of recognition sections having different recognition terminal points, and
the voice recognition means performing voice recognition on each of the plurality of recognition sections having different recognition terminal points.
In the interactive device as described above, the recognition section setting means divides the user's utterance duration at the recognition terminal points into predetermined lengths of time to set a plurality of recognition sections having different lengths. The voice recognition means performs voice recognition with respect to each of the recognition sections. This allows quick decision of a voice recognition result at every recognition terminal point. That is, a recognition result (a partial recognition result) can be output for each of the plurality of recognition sections.
Preferably, the interactive device according to the present invention has a construction that the recognition section setting means comprises:
a recognition starting point setting unit that detects the utterance starting end frame and sets the recognition starting point at the detected utterance starting end frame,
a recognition terminal point setting unit that sets the recognition terminal point at a frame which is the predetermined length of time ahead of the recognition starting point set by the recognition starting point setting unit; and
a recognition terminal point updating unit that updates repeatedly the recognition terminal point set by the recognition terminal point setting unit to a frame which is the predetermined length of time ahead of the recognition terminal point,
the recognition terminal point updating unit detects an utterance terminal end frame serving as a terminal end of the user's utterance in the input voice and updates the recognition terminal point to the detected utterance terminal end frame, said recognition terminal point being either one of the recognition terminal point set by the recognition terminal point setting unit and the recognition terminal point updated by the recognition terminal point updating unit,
the voice recognition means comprises:
a first-path searching unit that searches word candidates in the user's utterance in a direction from the utterance starting end frame to the utterance terminal end frame, and
a second-path search unit that searches the word candidates in each of the plurality of recognition sections having different recognition terminal points in a direction from the recognition terminal point to the recognition starting point according to a search result produced by the first-path searching unit, and
the response action determining means determines, when a search result produced by the second-path search unit includes the key phase, the response action corresponding to the key phrase.
In the interactive device as described above, the recognition terminal point updating unit updates repeatedly a recognition terminal point to a frame which is a predetermined length of time ahead of the recognition terminal point to thereby set a plurality of recognition sections of different lengths. The first-path searching unit performs a search throughout user's entire utterance duration, and the second-path searching unit performs a search with respect of each of the plurality of recognition sections, achieving voice recognition improved both in speed and accuracy.
Further, preferably, the interactive device according to the present invention has a construction that the recognition section setting means comprises a recognition starting point updating unit that, when the search result by the second-path search unit includes a break in the user's utterance, updates the recognition starting point set by the recognition starting point setting unit to a frame located at a top of the break in the user's utterance, and
the second-path search unit searches the word candidates with respect to each of the plurality of recognition sections having different recognition starting points and different recognition terminal points.
In the interactive device as described above, if a break in the user's utterance duration such as a short pause, a filler or the like is detected by the second-path searching unit, the recognition starting point updating unit updates the recognition starting point to a frame located at the top of the break in the utterance duration. Thus, in the interactive device, even if the recognition terminal point updating unit updates the recognition terminal point repeatedly to prolong the recognition section stepwise, the recognition starting point updating unit is able to prevent the recognition section from becoming too long. Consequently, the interactive device is advantageous in that it is able to prevent an excessive prolongation of each recognition section to be reversely searched by the second-path search, which results in a reduction of a time taken by the second-path search in a proper response speed.
Still further, preferably, the interactive device according to the present invention has a construction that the key phrase included in the search result by the second-path search unit is made up of a plurality of words.
In the interactive device as described above, the response action determining means determines a response action according to whether or not a search result by the second-path searching unit includes a key phrase made up of a plurality of words. Thus, the interactive device, when continuous voice recognition is performed with respect to each of short lengths of time (for example, 200 msec) obtained by dividing the user's utterance duration, can determine a response action more accurately and more precisely because determination of a response action is not based on a single word candidate so that an error word-candidate in the search results of the second-path searching unit does not affect the determination of the response action.
Yet further, preferably, the interactive device according to the present invention has a construction that the second-path searching unit calculates a word reliability factor indicative of a degree of plausibility of the searched word candidate, and
the response action determining means determines, when the search result by the second-path searching unit includes the predetermined key phrase and when the word candidates corresponding to the key phrase have word reliability factors each above a predetermined value, the response action corresponding to the key phrase.
In the interactive device as described above, the response action determining means determines a response action only when a search result by the second-path searching unit includes a key phrase and word candidates corresponding to the key phrase have word reliability factors above a predetermined threshold value. Thus, the interactive device determines a response action more accurately and more precisely than conventionally.
Yet further, preferably, the interactive device according to the present invention has a construction that it further comprises:
a response action storing means that stores, in relation with each other, the key phrase, the response action corresponding to the key phrase, and a response action category serving as a category of the response action, and
a response action history storing means that stores a history of response actions already determined by the response action determining means,
wherein, when the search result by the second-path search unit includes the key phase, the response action determining means judges, by referring to the response action storing means and the response action history storing means, whether or not a response action category of a response action determined currently by the response action determining means and a response action category of a response action determined previously by the response action determining means are the same, and determines, when the both categories are the same, the response action corresponding to the key phrase.
In the interactive device configured as described above, the response action determining means determines a response action only when the response action and currently determined response action belong to the same category as that of the previously determined response action. Consequently, the interactive device is able to prevent determination of a response, which is based on a wrong search result produced by an error in a search by the second-path searching unit, and which is totally irrelevant to action according to an error a wrong search result by the second-path searching unit that is not related with the previously determined response action.
Yet further, preferably, the interactive device according to the present invention has a construction that, when a response action determined according to a last search result by the second-path search unit and a response action determined according to a previous search result by the second-path search unit are different, the response action executing means executes the response action determined according to the last search result.
In the interactive device configured as described above, the response action executing means executes the response action according to the final second-path search result by the second-path searching unit. Thus, the interactive device is able to prevent a wrong response action from being executed even if the response action according to the second-path search result by the second-path searching unit is produced by an error.
Yet further, preferably, the interactive device according to the present invention has a construction that, when a last search is performed by the second-path search unit after the start of an execution of a response action determined by the response action determining means and when a result of the last search and a result of a previous search corresponding to the response action currently being executed are different, the response action executing means cancels the response action currently being executed and executes a predetermined response action for correcting the response action currently being executed and then executes a response action determined by the response action determining means according to the last search result by the second-path search unit.
In the interactive device configured as described above, when the second-path searching unit produces an error second-path search result halfway through the user's utterance duration and a response action is already determined and executed according to the error second-path search result, the error second-path search result is able to be corrected and the final second-path search result at the terminal end of utterance duration by the second-path searching unit can be adopted to exert a response action.
According to the interactive device of the present invention, by dividing the user's utterance duration by a predetermined length of time into a plurality of recognition sections, performing continuous voice recognition with respect to each of the plurality of recognition sections, a proper response speed required for continuous recognition of the plurality of recognition sections is ensured while preventing an excessive prolongation of each recognition section and thus preventing a reduction in the recognition rate thereof.