In an aspect, speech recognition technology is used to detect a word registered in a dictionary from an input voice. This technique is known as word spotting. In the word spotting, one or more words used in searching are stored in advance in the dictionary, and only registered words are extracted from the input voice. Thus, the word spotting technique may be used in voice information search. However, even the same word may be different in pronunciation, i.e., a waveform of pronounced word may be different depending on speakers or even from one utterance to another by the same speaker. This may cause an error in recognition, which does not occur in recognition of written text.
In addition to the word spotting, the speech recognition technology is also used to recognize a voice in a dialogue. For example, a speech recognition method is known which learns an acoustic model and a language model in accordance with the length of a speech period or a time elapsed since the start or the end of the speech period thereby to enhance accuracy in recognizing voices with feature values difficult to accurately distinguish as in spoken words in a dialogue. Another example is a state-based dialogue division apparatus configured to divide voice data of a dialogue between two persons into a plurality of pieces depending on states thereof thereby to achieve an improvement in accuracy of a data mining result.
More specifically, the state-based dialogue division apparatus detects speech periods of respective speakers from voice data and compares the ratio of time between the speech periods of two speakers with at least two threshold values. In accordance with the result of the comparison with the threshold values, the state-based dialogue division apparatus divides the dialogue data into a plurality of pieces according to states such as a state in which one speaker is talking about a certain subject, a state in which another speaker answers, etc.
Descriptions of techniques associated with the speech recognition technology may be found, for example, in International Publication Pamphlet No. WO/2008/069308, Japanese Laid-open Patent Publication No. 2010-266522, etc.
However, the techniques described above have a problem that it is difficult to accurately detect a reply uttered by a speaker in response to an utterance of another speaker, as described below.
That is, a reply uttered by a speaker in response to an utterance of another speaker is short in length as is the case with, for example, “yes”, “no”, or the like, and includes a less amount of information than other utterances. Therefore, even when the speech recognition method or the state-based dialogue division apparatus described above is used, there is a limit on the accuracy in detecting replies. It may be possible to increase the probability of replies being detected by reducing the detection threshold value that is compared with the score calculated for the input voice. However, this may cause noise or other words to be incorrectly recognized as replies, which results in a reduction in accuracy in detecting replies.
Another problem with the conventional techniques described above is that when a word is uttered, if there is another word that has the same pronunciation but that has a different meaning, there is a possibility that the word is erroneously detected as a reply. That is, for example, when “yes” is uttered, there are two possibilities, i.e., a first possibility is that “yes” is uttered by a speaker as a reply in response to a speech of another speaker, and a second possibility is that yes” is used to call attention rather than to respond to a speech of another speaker as in an example “Yes, it is now time.” In such a case, it may be difficult to detect whether an utterance is a reply or not.
Furthermore, in the speech recognition method described above, it is assumed that there is only one speaker, and no consideration is made as to whether a voice being recognized is of a dialogue or not. That is, it is difficult to determine whether the voice is of a dialogue or not. On the other hand, in the state-based dialogue division apparatus described above, the state of the dialogue is estimated based on the utterance period length. However, there is no correlation between the utterance period length and the content of the utterance in terms of whether the content includes a word used as a reply, and thus it is difficult to detect only a reply uttered in response to a speech.