The present disclosure relates to an information processing apparatus, an information processing method, and a program. Specifically, the present disclosure relates to an information processing apparatus which performs a speech recognition process and a speech comprehension process estimating the intention of an utterance, and to an information processing method as well as a program.
In recent years, various products and services to which speech recognition is applied have been widely used. Speech recognition is a technique of analyzing speech signals input through a speech input portion such as a microphone and automatically determining a word group corresponding to the input speech signals. By combining the speech recognition technique and various applications, various products and services performing data processing based on the result of the speech recognition are realized.
A basic configuration of the speech recognition process will be described with reference to FIG. 1. A speech 11 input by a user is captured by a microphone 12, and an AD converter 13 samples the analog signals of the speech, thereby generating digital data. The digital data is input to a characteristic extraction portion 14, and through a frequency analysis or the like which is performed at proper time intervals, the data is converted into parameters showing a spectrum or other acoustic characteristics of the speech.
By the process of the characteristic extraction portion 14, a time series of the characteristic amount of the speech is obtained. The characteristic amount group is sent to a matching portion 15. The matching portion 15 matches the respective information of acoustic model data 16, dictionary data 17, and grammar data 18 with the input parameters, and outputs a speech recognition result 19.
Furthermore, in the characteristic extraction portion 14, in addition to the extraction of the characteristic amount group, a speech section is determined. The speech section corresponds to a section from the start time to the end time of an utterance. As a method of detecting the speech section, for example, a method of extracting only a section of an utterance based on the power or the like of a speech signal is used. The matching portion 15 performs a matching process with respect to the characteristic amount group corresponding to the speech section, thereby outputting the speech recognition result 19 for each utterance of the user.
The acoustic model data 16 is a model holding acoustic characteristics such as individual phonemes and syllables used in a language to be handled including, for example, Japanese or English. As this model, a Hidden Markov Model (HMM) or the like is used.
The dictionary data 17 is data holding information on the pronunciation of individual words to be recognized. By the data, words are associated with the acoustic model described above, and as a result, a standard acoustic pattern corresponding to individual words included in a dictionary is obtained.
The grammar data 18 is data in which the ways in which the individual words described in the dictionary can be catenated to each other are described. For the grammar data, a description based on a formal grammar or a context-free grammar, a grammar (N-gram) including a statistical probability of word catenation or the like is used.
In the matching portion 15, by using the acoustic model data 16, the dictionary data 17, and the grammar data 18, the most suitable word group for the input characteristic amount group is determined. For example, when the Hidden Markov Model (HMM) is used as the acoustic model data 16, a value which is obtained by accumulating a probability of the emergence of each characteristic amount according to the characteristic amount group is used as an acoustic evaluation value (hereinafter, referred to as an acoustic score). This acoustic score is determined for each word by using the standard pattern described above.
For example, when a bigram is used as the grammar data 18, the linguistic probability of each word is converted into a numerical value based on the probability that the word is catenated to the immediately preceding word, and the value is provided as a linguistic evaluation value (hereinafter, referred to as a linguistic score). Thereafter, the acoustic score and the linguistic score are evaluated comprehensively, whereby the most suitable word group for the input speech signal is determined.
For example, when a user says “The weather is nice today”, a word group including “The”, “weather”, “is”, “nice”, “today” is obtained as a recognition result. At this time, an acoustic score and a linguistic score are provided to each word. Furthermore, in the present disclosure, a combination of the dictionary data 17 and the grammar data 18 as described above is referred to as a linguistic model.
When the speech recognition technique is applied to a product and a service, the following two methods are widely used.
(a) A method of directly associating a recognized word group with the corresponding behavior.
(b) A method of extracting the intention of the user included in the utterance from a recognized word group and associating the intention with the corresponding behavior.
For example, when an utterance “stand up” is given to a robot, a method of causing the robot to stand up in response to the recognized word group “stand up” is the former (a) method, that is, the method of directly associating the words with the corresponding behavior.
On the other hand, a method of estimating the intention (for example, intention of “stand up please”) included in each utterance such as “stand up”, “wake up”, and “get up”, and causing the robot to act in response to the intention is the latter (b) method. That is, this is a method of extracting the user's intention included in the utterance and associating a corresponding behavior with the intention.
In general, since there is a plurality of types of utterance including the same intention, compared to the former (a) method of directly assigning a corresponding behavior to the recognized word group, the latter (b) method of estimating the intention of the utterance and assigning a corresponding behavior to the intention can more easily assign the behavior. In this manner, an apparatus estimating the intention of an utterance from input speech signals is called a speech comprehension apparatus.
As a technique in the related art describing a method of estimating the user's intention included in an utterance, for example, there is a Japanese Unexamined Patent Application Publication No. 2006-53203 “SPEECH PROCESSING DEVICE AND METHOD, RECORDING MEDIUM AND PROGRAM”.
In the method described in Japanese Unexamined Patent Application Publication No. 2006-53203, a technique of estimating intention (although intention is referred to as “will” in Japanese Unexamined Patent Application Publication No. 2006-53203, the “will” will be referred to as “intention” having the same meaning in the following description as long as this does not cause confusion) based on input speech signals is described. In Japanese Unexamined Patent Application Publication No. 2006-53203, acoustic score calculation means showing acoustic similarity between a word group, which is configured based on grammar rules and a word dictionary, corresponding to intention information showing an intention, for example, “stand up please” as an intention and input speech signals, and linguistic score calculation means showing linguistic similarity are provided, and intention information showing the intention corresponding to the input speech signals is selected from a plurality of types of intention information based on the acoustic and linguistic scores calculated for each intention information, whereby the intention is estimated.
However, generally, as the total amount of the intention information increases, the accuracy of estimating intention with respect to the input speech decreases, and the calculation amount increases.
For example, specifically, if an information processing apparatus processing information based on the speech recognition is a television which includes a function of recording and playback, a user can make a plurality of different requests (intention) with respect to the television, such as “please change the channel”, “please turn the volume up”, “please record”, “please play”, “please play with fast forward”, and “please play slowly”.
In this manner, in the apparatus which is likely to receive various types of requests, when the acoustic score calculation means showing the similarity between a word group and the speech signals described above and the linguistic score calculation means showing the linguistic similarity are applied to perform a process of selecting the intention information showing the intention corresponding to the input speech signals from a plurality of types of intention information, the calculation amount necessary for the process increases, so the accuracy of intention estimation decreases.