1. Field of the Invention
The present invention relates to a voice processing device and a voice processing method, and a program, and more particularly, to a voice processing device and a voice processing method, and a program which are capable of reliably estimating correct intention from an input voice.
2. Description of the Related Art
In recent years, there has been developed a variety of products or services to which voice recognition is applied. The voice recognition refers to a technique for recognizing a word sequence corresponding to an input voice, using appearance probability or the like of a feature amount indicating acoustic features.
FIG. 1 is a block diagram illustrating a configuration example of a voice recognition device in the related art using the voice recognition.
A voice recognition device 1 in FIG. 1 includes an input section 21, an AD converting section 22, a feature extraction section 23, a matching section 24, an acoustic model database 25, a dictionary database 26 and a grammar database 27.
A voice based on an utterance of a user is input to the input section 21 which includes a microphone or the like. The input section 21 converts the input voice into a voice signal which is an analog electric signal for output.
The AD converting section 22 converts the analog input voice signal which is output from the input section 21 into a digital input voice signal for output, through sampling and quantization.
The feature extraction section 23 frequency-analyzes the input voice signal which is output from the AD converting section 22 at an appropriate time interval, to thereby extract parameters indicating a spectrum or other acoustic features of the voice. The parameters extracted in this way correspond to a feature amount of the input voice signal. A time sequence of the feature amount of the input voice signal (hereinafter, referred to as a feature amount sequence) is output from the feature extraction section 23.
The feature extraction section 23 extracts the feature amount sequence of the input voice signal in this way, and determines a voice zone of the input voice signal. The voice zone represents a zone ranging from a starting time of the utterance to an ending time thereof.
The matching section 24 determines a word sequence which is the most compatible with the feature amount sequence extracted by the feature extraction section 23, and outputs the determined word sequence as a voice recognition result. Hereinafter, the process thus performed by the matching section 24 is referred as a matching process. The matching section 24 performs the matching process with respect to the voice zone which is determined by the feature extraction section 23 and thereby sequentially output the voice recognition results for all the voice zones.
In this respect, when performing the matching process, the matching section 24 uses the acoustic model database 25, the dictionary database 26 and the grammar database 27.
The acoustic model database 25 records therein an acoustic model indicating an acoustic feature for each predetermined unit such as an individual phoneme or a syllable in a language of the voice which is a recognition target. As the acoustic model, for example, an HMM (Hidden Markov Model) can be employed.
The dictionary database 26 records therein a dictionary which describes information (hereinafter, referred to as pronunciation information) about pronunciation of each word of the voice which is the recognition target. Thus, each word and the acoustic model are related to each other. As a result, an acoustic standard pattern is obtained corresponding to each word which is recorded in the dictionary database 26.
The grammar database 27 records therein a grammar rule which describes how respective words recorded in the dictionary database 26 can be concatenated. As the grammar rule, for example, a regular grammar, a context-free grammar, or an N-gram grammar including a statistical word concatenation probability can be employed.
For example, in a case where the HMM is employed as the acoustic model in the acoustic model database 25, the matching section 24 accumulates the appearance probability of the feature amount according to the feature amount sequence which is extracted by the feature extraction section 23. That is, since the appearance probability of the feature amount of each word is accumulated using the above described standard pattern, an acoustic evaluation value (hereinafter, referred to as an acoustic score) is calculated for each word.
Further, for example, in a case where a bigram is employed as the grammar rule in the grammar database 27, the matching section 24 calculates linguistic possibility for each word on the basis of the concatenation probability with respect to the preceding word. This linguistic possibility of each word is digitized as a linguistic evaluation value (hereinafter, referred to as a language score).
The matching section 24 determines a word sequence which is the most compatible with the input voice supplied to the input section 21, on the basis of a final evaluation value (hereinafter, referred to as a total score) which is obtained by totally evaluating the acoustic score and the language score with respect to each word. The determined word sequence is output as a voice recognition result.
For example, in a case where a user makes an utterance “KYO-WA-II-TENKI-DESUNE (It is nice weather today)”, a word sequence of “KYO”, “WA”, “II”, “TENKI” and “DESUNE” is output as the voice recognition result. When such a word sequence is determined, as described above, the acoustic score and the language score are given to each word.
In a case where such a voice recognition device is applied to a robot, an operation of the robot should be related to the word sequence which is recognized according to the voice recognition. As techniques for realizing this relation, there are the following first and second techniques.
The first technique is a technique in which a word sequence is recognized according to the voice recognition and a corresponding operation is directly related to the recognized word sequence. For example, in a case where a user makes an utterance “TATTE (Stand up)”, the robot can be controlled so as to perform an operation corresponding to the word sequence “TATTE” which is recognized according to the voice recognition, that is, controlled to stand up.
The second technique is a technique in which a user's intention implied in the utterance is extracted from the word sequence which is recognized according to the voice recognition, and a corresponding operation is related to this intention. According to the second technique, for example, with respect to utterances such as “TATTE (Up)”, “OKITE (Get up)”, “TACHIAGATTE (Stand up)” which are uttered to the robot by a user, the respective utterances are recognized according to the voice recognition. Since intention (for example, “TATTE-KUDASAI (Please stand up)” in this case), implied in the respective utterances recognized in this way is estimated, the robot can be controlled so as to perform an operation (for example, a stand up operation in this case) corresponding to the intention.
In general, while one operation corresponds to one intention, a plurality of utterances exists corresponding to one intention. Thus, according to the first technique, since one operation should correspond to one word sequence, the same operation should correspond to a plurality of word sequences which corresponds to one intention, respectively. On the other hand, according to the second technique, one operation has to correspond to one intention which corresponds to the plurality of word sequences. Accordingly, as the technique for relating the operation to the word sequence which is recognized according to the voice recognition, the second technique is more appropriate than the first technique.
In order to realize such a second technique, a device is employed which is configured to estimate the user's intention implied in the utterance from the word sequence recognized according to the voice recognition. Hereinafter, such a device is referred to as a voice understanding device.
In order to estimate the user's intention implied in an utterance, a voice understanding device in the related art determines a word sequence which is compatible with an input voice signal based on the utterance, on the basis of a word dictionary corresponding to intention information indicating one intention and a grammar rule. Such a word sequence is determined with respect to each of the plural pieces of intention information. Then, this voice understanding device calculates the similarity between the determined word sequence and an input voice signal with respect to each of the plural pieces of intention information. Specifically, an acoustic score indicating an acoustic similarity and a language score indicating a language similarity are calculated as values indicating the similarity, with respect to each of the plural pieces of intention information. Then, the voice understanding device of the related art estimates intention which corresponds to the input voice signal among the plural pieces of intention information using the two scores (for example, refer to Japanese Unexamined Patent Application Publication No. 2006-53203).