In recent years, a method of accepting, as a voice input, a language which a human being speaks, and then performing an operation by using a result of recognition of the input has captured a spotlight. Although this technique is used as a voice interface for use in mobile phones, car navigation, and so on, there is, as a basic method, a method of defining a correspondence between voice recognition results, which are assumed in advance by the system, and operations, and performing an operation when a voice recognition result is an assumed one. Because this method makes it possible for the user to directly perform an operation by uttering a voice, as compared with a conventional manual operation, the method works effectively as a shortcut function. On the other hand, the user needs to utter words for which the system is waiting in order to perform an operation, the number of words which the user should memorize increases with increase in the number of functions which the system handles. A further problem is that in general, there are few users who use the system after sufficiently understanding the instruction manual, and, as a result, there is a case in which the user does not understand how the user needs to utter in order to perform any of many operations, and cannot actually operate any operation other than limited functions by uttering a voice.
As a solution to the problem, a method of understanding a user's intention from the user's utterance and performing an operation, instead of connecting a voice recognition result directly with an operation, is disclosed. As one example of implementing the method, there is an example of defining a correspondence between uttered example sentences, which are collected in advance, and operations (referred to as learned data from here on), modeling operations (referred to as intentions from here on) which the user desires from the user's words by using a statistical learning method, and estimating an intention for a user input by using this model (referred to as a statistical intention estimation from here on). In a concrete process of the statistical intention estimation, terms which are used for learning are extracted first from the uttered example sentences of the learned data. Then, the term set and the correct intention are defined as input learned data, and the weight between each of the terms and the correct intention is learned according to a statistical learning algorithm and a model is outputted.
As terms which are used for learning, there are typically words and word strings which are extracted from data acquired by carrying out a morphological analysis on uttered example sentences. For example, from an uttered example sentence “ (OOeki ni ikitai (Drive to OO station))”, the following morphological analysis result: “ (OOeki) (proper noun, facility)/ (ni) (particle)/ (iki) (verb, continuative form)/ (tai) (auxiliary verb)” is acquired. When the morphological analysis result is acquired, a term such as “$facility$,  (iku)” (a facility having a proper noun is converted into a special symbol $facility$, and a verb is converted into its infinitive), and a two-contiguous-morpheme term such as “$facility$_ (ni),  (ni_iki), (iki_tai)” is extracted.
As a result, for the terms “$facility$,  (iku), $facility$_ (ni), (ni_iki), (iki_tai)”, a correct intention which is expressed as “destination_setting[destination=$facility$]” (a main intention is a destination setting, and a destination to be set is $facility$) is generated, and a model is generated on the basis of term sequences generated from a large volume of utterance data and the learned data which consist of correct intentions. As a method for generating a model, a machine learning algorithm is used. According to the machine learning algorithm, machine learning is performed on the weight between an input term and a correct intention in such a way that the largest number of correct intentions can be generated for every of the learned data. Therefore, for a term set acquired from an utterance similar to learned data, a model with a high possibility of outputting a correct intention is acquired. As this machine learning method, for example, a maximum entropy method can be used.
Because a user's operation intention can be estimated with flexibility even for an input which is not a preassumed one by estimating an intention corresponding to the user's input by using a model generated according to such a machine learning algorithm as above, the intention can be understood appropriately and the operation can be performed even if the input is an utterance of a user who does not remember its regular expression. On the other hand, the acceptance of such a free input improves the flexibility of the system and increases the possibility that the user makes still more various utterances.
The assumed various utterances are roughly split into the following two groups.
(a) Inputs each using still more various words for a single operation,
(b) Inputs each of which is a request consisting of a plurality of operations and made as a batch.
In the case of above-mentioned (a), various utterances can be processed by further increasing the learned data. In contrast, in the case of (b), because each learned data is brought into correspondence with a single intention from the first, when a request includes a plurality of intentions, a process of combining appropriate intentions cannot be performed.
To solve this problem, patent reference 1 discloses a speaking intention recognition device that determines a sequence of appropriate intentions for an input including one or more intentions by using a model which has been learned as single intentions. This speaking intention recognition device prepares, as learned data, morpheme strings which serve as separators of intentions for input morphemes in advance, estimates splitting points at each of which the input can be split, like in the case of the above-mentioned understanding of intentions, and multiplies the possibility of splitting the input at each splitting point and the probability of intention of each split element to estimate a most likelihood intention sequence.