In recent years, the range of applications of speech recognition technology is on the rise, and “dialogue speech recognition” technology which converts a speech dialogue between persons into text is included therein. The “dialogue” or “speech dialogue” referred to herein indicates person-to-person information exchange by speech, and it is different from technology of person-to-machine “dialogic” interaction using speech.
From the viewpoint of basic technology, there is no significant difference between dialogue speech recognition and large vocabulary continuous speech recognition. Specifically, after, upon input of a speech waveform, cutting out a speech interval therefrom and extracting a speech feature quantity such as cepstrum, conversion from the feature quantity to a phoneme and conversion from the phoneme to a character sequence (word sequence) are performed simultaneously, and a conversion result with the maximum likelihood is output as text. In general, a set of conversion likelihoods from a feature quantity to a phoneme is called an acoustic model, and a set of conversion likelihoods from a phoneme to a character sequence (word sequence) is called a linguistic model.
The likelihood of the occurrence of a certain word sequence W in response to an input speech signal X is given by the following equation (1).P(W|X)=P(X|W)P(W)P(X)  Equation (1)
Because speech recognition processing is processing for obtaining a word sequence W′ with the maximum likelihood for an input speech, it can be represented as the following equation (2).
                                                                        W                ′                            =                            ⁢                              argmax                ⁢                                                                  ⁢                                  P                  ⁡                                      (                                          W                      ❘                      X                                        )                                                                                                                          =                            ⁢                                                P                  ⁡                                      (                                          X                      ❘                      W                                        )                                                  ⁢                                  P                  ⁡                                      (                    W                    )                                                                                                          Equation        ⁢                                  ⁢                  (          2          )                    
P(X|W) is given by the acoustic model, and P(W) is given by the linguistic model. Because an enormous amount of calculations is required to obtain the likelihoods of all word sequences W, P(X|W) is generally processed by being divided into units of phonemes. Further, various approximate calculations are used also for P(W). A representative example is N-gram language model. When the word sequence W consists of w1, w2, w3, . . . , wk, the probability of occurrence P(W) is as the following equation (3), and therefore the number of parameters which act on the likelihood calculation increases as the word sequence becomes longer.P(W)=p(w1)p(w2|w1)p(w3|w1,w2) . . . p(wk|w1,w2, . . . ,wk-1)  Equation (3)
This is approximated as follows so as to refer to the nearest (N−1) word of a certain word.P(W)˜p(w1)p(w2|w1)p(w3|w1,w2) . . . p(wk|wk-n+1, . . . ,wk-1)
Generally, a speech recognition program attains higher speed by not performing hypothetical calculation with a low likelihood. For example, in Non Patent Literature 1, beam search algorithm is used. The algorithm excludes word sequence candidates which are obtained at the time point of processing an input speech halfway through and whose likelihood up to that time point do not satisfy a given threshold from candidates for search. Further, higher speed can be attained also by reducing the number of word sequences or acoustic hypotheses to serve as calculation targets. For example, when it is known that speech related to politics is input, only the word sequence related to politics may be evaluated, and the word sequence related to comics may be excluded. A similar effect can be obtained by giving a linguistic model in which the likelihood for the latter becomes extremely low, not by completely excluding them from calculation. Further, as another example, when it is known that a speaker is male, it is not necessary to obtain the acoustic likelihood for a female voice, and the amount of calculations can be reduced. Such reduction of calculation targets, when done appropriately, contributes not only to an increase in speed but also to improvement of recognition accuracy. In this specification, appropriately reducing calculation targets is represented as “placing a condition” in some cases below.
The key point for improving the accuracy of speech recognition technology is to predict the content of input speech and appropriately place a condition that reflects the same on a speech recognition process. For example, when a speaker is identified, an acoustic model according to the speaker may be used as the condition. When a topic of the content of utterance is identified, recognition accuracy is improved by using a linguistic model according to the topic as the condition. When a plurality of speakers speak, an acoustic model may be switched by detecting a change of speakers in some way. When a plurality of topics are presented in turn during utterance, a linguistic model may be switched according to a change of topics. Examples of such techniques are described in Non Patent Literature 2 and Patent Literature 1.
In Non Patent Literature 2, a system that recognizes speech in “baseball live coverage” is described. Because an announcer, which is a speaker, becomes excited or quiet according to the situation of a game, an acoustic feature is not constant even with the same speaker, which causes degradation of speech recognition accuracy. Further, the property that confusion of acoustically similar words such as “Hanshin” and “Sanshin” (strikeout) is likely to occur is found. In view of this, in the speech recognition system described in Non Patent Literature 2, the baseball coverage is structured using the progress (status) of a game such as “whether the count is two strikes or not”. Then, in the speech recognition system described in Non Patent Literature 2, the progression of the status is predicted, and speech recognition is performed by appropriately switching an acoustic model (a usual state model or an excited state model) or a linguistic model (models prepared separately for each stroke count) dependent on the status.
In Patent Literature 1, a speech dialog system that searches for information with person-to-machine dialogic interaction is described. The system described in Patent Literature 1 prompts a user to input certain information next, and therefore the content of the next utterance can be predicated to a certain degree as long as the user follows it. With use of this, a linguistic model is switched according to the question presented to the user.
The techniques to improve speech recognition accuracy described in Non Patent Literature 2, Patent Literature 1 and the like can be applied also to the case of a dialogue speech to a certain degree. However, the dialogue speech has features not found in the speech at which the exemplified speech recognition targets.
A first feature of the dialogue speech is that there is a possibility that a plurality of speakers speak at the same time. Because general speech recognition technology is developed on the assumption of a single speaker, such speech cannot be recognized as it is.
For example, in the case of TV program speech, although speech can be recognized without difficulty in a scene where people speak one by one in turns, speech cannot be recognized in a scene where a plurality of people intensely quarrel with one another. A news show is an example of the former, and a variety show is an example of the latter. This is part of the reason that recognition technology for a variety show is immature today while news speech recognition is putting to practical use. When some measures can be taken at the stage of recording, a method that prepares a plurality of microphones and records the speech of one speaker per microphone as a general rule may be used. If the speech of one speaker is recorded by one microphone, even when a plurality of speakers speak at the same time, only the speech of a speaker is included in each recorded speech, so that the issue can be prevented.
A second feature of the dialogue speech is to have the property that a speaker of dialogue speech speaks within the range that a person who is a listener can hear, without consideration of the existence of a speech recognition system. This leads to degradation of the recognition accuracy of the speech recognition system.
When a speaker takes the existence of the speech recognition system into consideration, it is expected that the content of utterance can be controlled so that the system can easily recognize it. For example, when extremely rapid speech, small voice, muffled voice or the like is input, it can be prompted to speak again, and it is relatively easy to predict the next content of utterance as in the technique of Patent Literature 1. The system can earn recognition accuracy by placing a condition specialized to such utterance “controlled to fit the system”.
On the other hand, in “speech for a person” such as dialogue speech, because it is only necessary that a person who is a listener can understand, utterance which is unfavorable for the speech recognition system is made often. Although it is prompted to speak again when it is unfavorable also for a person who is a listener such as rapid speech or small voice as described above, phonological distortion due to speaker's feeling, distortion, abbreviation or the like of utterance of a phrase which is unnecessary for communication of a main intention often does not matter to a human listener, and they are input to the dialogue speech recognition system in an unchanged state. As an example of the phonological distortion due to speaker's feeling, the frequency of utterance in an excited state is higher than that of utterance in a usual state. Further, as an example of the distortion and abbreviation of utterance of a phrase which is unnecessary for communication of a main intention, “ . . . desu” is abbreviated as “ . . . su”, uttered very weakly and quickly, or linguistically eliminated.
Further, in the field of linguistics, a dialogue between two speakers is described as “a sequence of talks like A-B-A-B-A-B between two participants, where a participant A speaks and finishes speaking, and then another participant B speaks and finishes speaking” (c.f. Non Patent Literature 3). Thus, a dialogue is considered to have a basic structure of repeating “turn-shifting” or “turn-taking”. This structure is expandable as it is when there are three or more dialogue participants.
Although a person who mainly speaks in a speech dialogue is a speaker who has the turn to speak, there is a possibility that a speaker who does not have the turn to speak also speaks. According to Sacks, it is described that “in transitions from one turn to next turn, neither gap nor overlap usually occurs, and if any, it is short, and there is a general rule that basically one participant takes one turn and speaks” (cf. Non Patent Literature 3).